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9. TESTING INDIVIDUALS OF DIVERSE 
LINGUISTIC BACKGROUNDS 


Background 

For ail test rakers, any rest that employs lan¬ 
guage is, in part, a measure of their language 
skills. This is of particular concern for test 
takers whose first language is not the lan¬ 
guage of the test. Test use with individuals 
who have not sufficiently acquired the lan¬ 
guage of the test may introduce construct- 
irrelevant components to the testing process. 
In such instances, test results may not reflect 
accurately the qualities and competencies 
intended to be measured. In addition, lan¬ 
guage differences are almost always associated 
with concomitant cultural differences that need 
to be taken into account when tests are used 
with individuals whose dominant language 
is different from that of the test. Whether 
a certain dialecc of a language should be 
considered a different language cannot be 
resolved here, although some aspects of 
the present discussion are relevant to the 
debate. In either case, special attention to 
issues related to language and culture may 
be needed when developing, administering, 
scoring, and interpreting test scores and mak¬ 
ing decisions based on test scores. Language 
proficiency tests, if appropriately designed 
and used, are an obvious exception to this 
concern because chey are intended to meas¬ 
ure familiarity with the language of the test 
as required in educational and ocher settings. 

Individuals who are bilingual can vary 
considerably in their ability to speak, write, 
comprehend aurally, and read in each lan¬ 
guage. These abilities are affected by the 
social or functional situations of communica¬ 
tion. Some people develop socially and cul¬ 
turally acceptable ways of speaking that 
combine two or more languages simultane¬ 
ously. Other individuals familiar with two 
languages may perform more slowly, less effi¬ 
ciently, and at times less accurately on prob¬ 


lem-solving rasks that are administered in 
the less familiar language. Language domi¬ 
nance is not necessarily an indicator of lan¬ 
guage competence in taking a test, and some 
accommodation may be necessary even when 
administering the test in the more familiar 
language. Therefore it is important to consid¬ 
er language background in developing, select¬ 
ing, and administering tests and in interpreting 
test performance. Consequently, for example, 
test norms based on native speakers of English 
either should not be used with individuals 
whose first language is not English or such 
individuals’ test results should be interpreted 
as reflecting in part current level of English 
proficiency rather chan ability, potential, apti¬ 
tude or personality characteristics or sympto¬ 
matology. In cases where a language-oriented 
test is inappropriate due to the test takers’ 
limited proficiency in that language, a non¬ 
verbal test may be a suitable alternative. 

Where effective job performance requires 
the ability to communicate in the language of 
the test, persons who do not have adequate 
proficiency in that language may perform poor¬ 
ly on the test, on the job, or both. In that case, 
the tests used for prediction of future job per¬ 
formance appropriately would be administered 
in the language of the job, as long as the lan¬ 
guage level needed for the test did not exceed 
the level needed to meet work requirements. 
Test users should understand that poor test 
performance, as well as poor job performance, 
may result from poor language proficiency 
rather than other deficiencies. 

Many issues addressed in this chapter are 
also relevant to testing individuals who have 
unique linguistic characteristics due to dis¬ 
abilities such as deafness and/or blindness. 
For example, issues regarding test translation 
and adaptation are applicable to American 
Sign Language (ASL) versions of traditional 
tests. It should be noted, however, that ASL is 
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not only 2 different language but is also a 
different mode of communication. Also, indi¬ 
viduals with disabilities may require modifica¬ 
tions in test administration procedures similar 
to those required by non-native speakers. A 
more specific discussion of testing individuals 
with disabilities is provided in chapter 10. 

Issues discussed in earlier chapters, in 
particular chapters 1-5, including validity of 
test score inferences, test reliability, and test 
development and administration are germane 
to this chapter. The present chapter extends 
these discussions, emphasizing the impor¬ 
tance of recognizing the possible impact of 
language abilities and skills on test perform¬ 
ance. There may be legal requirements relevant 
to the testing of individuals with different lan¬ 
guage backgrounds. The standards in this 
chapter are intended to be applied in a manner 
consistent with those requirements. 

Test Translation, Adaptation, and 
Modification 

Testing test takers in their primary language 
may be necessary in order to draw valid infer¬ 
ences based on their test scores. Thus, language 
modifications are often needed. Translating a 
test to the primary language represents one 
such modification. However, a number of 
hazards need to be avoided when doing this 
sort of translation. One cannot simply 
assume that such a translation produces a ver¬ 
sion of the test that is equivalent in content, 
difficulty level, reliability, and validity to the 
original untranslated version. Further, one 
cannot assume that test takers’ relevant accul¬ 
turation experiences are comparable across 
the two versions. Also, many words have dif¬ 
ferent frequency rates ot difficulty levels in 
various languages. Therefore, words in two 
languages rhat appear to be close in meaning 
may differ significantly in ways that seriously 
impact the translated test for the intended 
test use. Additionally, the test content of the 
translated version may not be equivalent to 


that of the original version. For example, a 
test of reading skills in language A that is 
translated to serve as a test of reading skills in 
language B may include content not equally 
meaningful ot appropriate for people who 
read only language B. 

For the purposes of test translation and 
adaptation for use with test takers whose first 
language is not the language of the test, back 
translation is not recommended as a stand¬ 
alone procedure. It may provide an artificial 
similarity of meaning across languages but not 
the best version in the new language. In most 
situations, an iterative process mote akin to test 
development and validation is suggested to 
ensure that similar constructs are measured 
across versions. When test forms in two or 
more languages are developed concurrently, it 
is generally desirable that some items originate 
in each of the languages involved. The decision 
as to whether to use the standard original lan¬ 
guage test or an adapted version is a complex 
matter. Issues that may have an impact on this 
decision are discussed in the next section. 

' Other strategies of test modification may 
be appropriate when the test takers primary 
language is not the language of the test. These 
include modifying aspects of the test or the 
test administration procedure such as the 
presentation format, che response format, the 
time allowed to complete the test, the test 
setting (individual administration instead of 
group testing), and the use of only those por¬ 
tions of the test that are appropriate for the 
level of language proficiency of the test taker. 
If modifications are made to the presentation 
or response format of the test, it may sometimes 
be appropriate for the modified test to be 
field tested with an adequate population sam¬ 
ple prior to use with its intended population. 

Issues of Equivalence 

The term equivalence, as used here, refers to 
the degree to which test scores can be used 
to make comparable inferences for different 
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examinees. When tests are designed for and 
used with linguistically homogeneous popu¬ 
lations, issues of equivalence are relatively 
straightforward (for example, see chapters 
1 and 4). If an individual examinee can be 
demonstrated to belong to the population 
for which the test was designed, then adher¬ 
ing to standard procedures of test adminis¬ 
tration and interpretation is expected to 
lead to reliable and valid inferences based 
on the examinee’s test score. When a test is 
intended for use with test takers who differ 
linguistically from those for whom the test 
was designed, establishing equivalence poses 
a greater challenge. In general, the linguistic 
and cultural characteristics of the intended 
examinee population should be reflected 
in examinee samples used throughout the 
processes of test design, validation, and 
norming. At each of these stages of test 
development and standardization, distinct 
linguistic groups should receive the same 
level of specific attention. The inclusion of 
proportional representation of linguistic 
subgroups in aggregate standardization and 
validation samples may be insufficient to 
assure equivalence across linguistic groups. 

Issues associated with construct equiva¬ 
lence are perhaps most fundamental. One 
may question whether the test score for a 
particular individual represents that individ¬ 
ual’s standing with respect to the same con¬ 
struct as is measured in the target populauon. 
For example, among non-native speakers 
of the language of the test, one may not 
know whether a test designed to measure 
primarily academic achievement becomes in 
whole or in part a measure of proficiency in 
the language of the test. There are several 
psychometric techniques that can be used 
to determine the equivalence of constructs 
across groups, including confirmatory factor 
analysis, analysis of data contained in multi- 
method-multitrait matrices and the equiva¬ 
lence of responsiveness of the groups to 
experimental manipulations. These tech¬ 


niques may be supplemented with logical 
analyses of the results based on knowledge 
of the linguistic characteristics of the test 
taker’s population of origin. 

Other types of equivalence also need to 
be considered when testing individuals from 
different linguistic backgrounds. Functional 
equivalence addresses the question of whether 
similar activities or behaviors measured by a 
test have the same meaning in different cul¬ 
tural or linguistic groups. Translation equiva¬ 
lence requires that the translated or adapted 
test be comparable in content to the original 
test; it was addressed above in the discussion 
of test translation and adaptation. Metric 
equivalence concerns the issue of whether 
scores from the same test administered in dif¬ 
ferent languages have comparable psycho¬ 
metric properties. For example, with metric 
equivalence, a score of 50 on test X in lan¬ 
guage A is interpretable in the same way as a 
score of 50 on test X in language B. In gener¬ 
al, metric equivalence will be limited to par¬ 
ticular contexts, examinee groups, and types 
of interpretations. 

Language Proficiency Testing 

Consideration of relevant within-linguistic 
group differences is crucial in determining 
appropriate test interpretation and decision 
malting in educational programs and in some 
professional applications of individualized 
tests. For example, individuals whose first 
language is not the language of the test may 
vary considerably in their proficiency along a 
continuum from those who have no knowl¬ 
edge of the language of the test to those who 
are fluent in it and knowledgeable of the cor¬ 
responding culture. Further, a demographic 
proxy such as Mexican or German is likely to 
prove insufficient in determining the lan¬ 
guage of test administration because members 
of the same cultural group may vary widely in 
their degree of acculturation, proficiency in 
the language of the test, familiarity with 
words and syntax in their native languages, 
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educational background, familiarity with tests 
and test-taking skills, and other factors that 
may significantly affect the reliability and 
validity of inferences drawn from test scores. 
Thus, it is essential that individual differences 
that may affect test performance be taken 
into account when testing individuals of 
differing linguistic backgrounds. 

The need exists to consider both lan¬ 
guage dominance and language proficiency. 
Standardized tests that assess multiple 
domains in a given language can be helpful 
in determining language dominance and 
proficiency. The person conducting the test¬ 
ing first should obtain information about 
the language in which the examinee is 
dominant (i.e., the preferred or salient lan¬ 
guage). Following this determination of 
dominance, the examinee’s level of profi¬ 
ciency in the dominant language should be 
established. If the languages are similarly 
dominant, then proficiency should be estab¬ 
lished for both (or all) languages. Then the 
test should be administered in the most 
proficient language if available (unless the 
purpose of the testing is to determine profi¬ 
ciency in the language of the test). However, 
testing individuals in their dominant lan¬ 
guage alone is no panacea because, as sug¬ 
gested above, a bilingual individual’s two 
languages are likely to be specialized by 
domain (e.g., the first language is used in 
the context of home, religious practices, 
and native culture, whereas the second lan¬ 
guage is used in the context of school, 
work, television, and mainstream culture). 
Thus, a test in either language by itself will 
likely measure some domains and miss out 
on others. In such situations, testing in 
both languages (i.e., the dominant language 
and the language in which the test taker is 
most proficient) may be necessary, provided 
appropriate tests arc available. If assessment 
in both languages is carried out, careful 
consideration should be given to the possi¬ 
bility of order effects. 


Because students are expected to acquire 
proficiency in the language used in schools 
that is appropriate to their ages and educa¬ 
tional levels, tests suitable for assessing their 
progress in that language are needed. For 
example, some tests, especially paper-and- 
pencil measures, that are prepared for stu¬ 
dents of English as a foreign language may 
not be particularly useful if they place insuffi¬ 
cient emphasis on the assessment of impor¬ 
tant listening and speaking skills. Measures of 
competency in all relevant English language 
skills (e.g., communicative competence, liter¬ 
acy, grammar, pronunciation, and compre¬ 
hension) are likely to be most valuable in the 
school context. 

Observing students’ speech in naturalis¬ 
tic situations can provide additional informa¬ 
tion about their proficiency in a language. 
However, findings from naturalistic observa¬ 
tions may not be sufficient to judge students’ 
ability to function in that language in for¬ 
mal, academically oriented situations (e.g., 
classrooms). For example, it is not appropri¬ 
ate to base judgments of a child’s ability to 
benefit from instruction in one language 
solely on language fluency observed in speech 
use on the playground. Nor is it appropri¬ 
ate to base judgments of a person’s ability to 
perform a job on assessments of formal lan¬ 
guage usage, if formal language usage is 
not linked to job performance. 

In general, there are special difficulties 
attendant upon the use of a test with individ¬ 
uals who have not had an adequate opportu¬ 
nity to learn the language used by the test. 
When a test is used to inform a decision 
process that has a broad impact, it may be 
important for the test user to review the test 
itself and to consider the possible use of 
alternative information-gathering tools (e.g., 
additional tests, sources of observational 
information, modified forms of the chosen 
test) to ensure that the information obtained 
is adequate to the intended purpose. Reviews 
of this kind may sometimes reveal the need 
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to create a formal adaptation of a test or to 
develop a new test that is suitable for the spe¬ 
cific linguistic characteristics of the individu¬ 
als being tested. 

Testing Bilingual Individuals 

Test use with examinees who are bilingual 
also poses special challenges. An individual 
who knows two languages may not test well 
in either language. As an example, children 
from homes where parents speak Spanish may 
be able to understand Spanish but express 
themselves best in English. In addition, some 
persons who are bilingual use their native 
language in most social situations and use 
English primarily for academic and work- 
related activities; the use of one or both 
languages depends on the nature of the sit¬ 
uation. As another example, proficiencies in 
conversational English and written English 
can often differ. Non-native English speakers 
who may give the impression of being fluent 
in conversational English may not be compe¬ 
tent in taking tests that require English litera¬ 
cy skills. Thus, an understanding of an 
individuals type and degree of bilingualism 
is important to proper test use. 

Administration and Examiner 
Variables 

When an examinee cannot be assumed to 
belong to the cultural or linguistic population 
upon which the test was standardized, then 
use of standardized administration procedures 
may not provide a comparable administration 
of the test for that examinee. In this situation, 
the fundamental principle of sound practice 
is that examinees, regardless of background, 
should be provided with an adequate oppor¬ 
tunity to complete the test and demonstrate 
their level of competence on the attributes 
the test is intended to measure. There may 
be, however, complex interactions among 
examiner, examinee, and situational variables 


that require careful attention on the part 
of the practitioner administering the test. 
Factors that may affect the performance of 
the examinee include the cultural and linguis¬ 
tic background of both the examiner and 
examinee; the gender and testing style of the 
examiner; the level of acculturation of the 
examinee and examiner; whether the test is 
administered in the original language of the 
test, the examinees primary language, or 
whether both languages are used (and if so 
in what order); the time limits of the testing; 
and whether a bilingual interpreter is used. 

Use of Interpreters in Testing 

Ideally, when an adequately translated version 
of the test or a suitable nonverbal test is 
unavailable, assessment of individuals with 
limited proficiency in the language of the test 
should be conducted by a professionally 
trained bilingual examiner. The bilingual 
examiner should be proficient in the language 
of the examinee at the level of a professional 
trained in that language. When a bilingual 
examiner is not available, an alternative is to 
use an interpreter in the testing process and 
administer the test in the examinee’s native 
language. Although a commonly used proce¬ 
dure, this practice has some inherent difficul¬ 
ties. For example, there may be a lack of 
linguistic and cultural equivalence between 
the translation and the original test, che trans¬ 
lator or the interpreter may not be adequately 
trained to work in the testing situation, and 
representative norms may not be available to 
score and interpret the test results appropri¬ 
ately. These difficulties may pose significant 
threats to the validity of inferences based on 
test results. 

When the need for an interpreter arises 
for a particular testing situation, it is impor¬ 
tant to obtain a fully qualified interpreter to 
assist the examiner in administering the test. 
The most important consideration in testing 
with the services of an interpreter is the inter- 
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prefers ability and preparedness in carrying 
out the required duties during testing. The 
interpreter obviously needs to be fluent in 
both the language of the test and the exami¬ 
nees native language and have general famil¬ 
iarity with the process of translating. To be 
effective, the interpreter also needs to have a 
basic understanding of the process of psycho¬ 
logical and educational assessment, including 
the importance of following standardized pro¬ 
cedures, the importance of accurately convey¬ 
ing to the examiner an examinees actual 
responses, and the role and responsibilities of 
the interpreter in testing. Additionally, it is 
inappropriate for the interpreter to have any 
prior personal relationship with the test taker 
that is likely to jeopardize the objectivity of 
the test administration. However, in small 
linguistic or cultural communities, speakers 
of the alternate languages are often known to 
each other. Therefore, in such cases, it is the 
responsibility of the test user or examiner to 
ensure that the interpreter has received ade¬ 
quate instruction in the principles of objec¬ 
tive test administration and to assess 
preexisting biases so that test interpretations 
can take such factors into account. If clear 
biases are evident and cannot be ameliorated, 
then the examiner should make arrangements 
to obtain another interpreter. 

Whenever proficiency in the language of 
the test is essential to job performance, use of 
a translator to assist a candidate with licen¬ 
sure, certification, or civil service examina¬ 
tions should be permitted only when it will 
not compromise standards designed to pro¬ 
tect public health, safety, and welfare. When 
a translator is permitted, it also is essential 
that the candidate not receive help interpret¬ 
ing the content of the test or any other assis¬ 
tance that would compromise the integrity 
of the licensure or certification decision. 
Creation of audio tapes that enable a candidate 
to listen to each question being read in the 
language of the test may be more appropriate 
when such an accommodation is justified. 


In educational and psychological testing, 
it may be appropriate for an interpreter to 
become familiar with all details of test con¬ 
tent and administration prior to the testing. 
Also, time needs to be provided for the inter¬ 
preter to translate test instructions and items, 
if necessary. In psychological testing, it is 
often desirable for the examiner to demon¬ 
strate for the interpreter how certain test 
items ate administered and explain what 
to expect during testing. In addition, it is 
important that, prior to the testing, the 
examiner and the interpreter become familiar 
with each other’s style of speaking and the 
speed at which they work. Immediately prior 
to the assessment, the role of the interpreter 
needs to be explained clearly to the examinee. 
It is essential that the interpreter make all 
efforts to provide accurate information in 
translation. The interpreter must reflect a 
professional attitude and maintain objectivity 
throughout the testing process (e.g., not 
interject subjective opinions, not give cues to 
the examinee). Once the testing is completed, 
che examiner is responsible for reviewing the 
test responses with the assistance of the inter¬ 
preter. Responses that are difficult to interpret 
(e.g., vocabulary words), nontest behaviors 
that might have special meanings (e.g., body 
language), as well as language factors (e.g., 
mixed use of two languages) and cultural fac¬ 
tors that might have an effect on testing 
results need to be discussed fully. This infor¬ 
mation is to be used then by the examiner in 
carefully evaluating the test results and draw¬ 
ing inferences from the results. 

Cultural Differences and Individual 
Testing 

Linguistic behavior that may appear eccentric 
or be judged to be less appropriate in one cul¬ 
ture may be seen as more appropriate in 
another culture and may need to be taken 
into account during the testing process. For 
example, children or adults from some cul- 
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tures may be reluctant to speak in elaborate 
language to adults or people in higher status 
roles and instead may be encouraged to speak 
to such persons only in response to specific 
questions or with formulaic utterances. Thus, 
when tested, such test takers may respond to 
an examiner probing for elaborate speech 
with only short phrases or by shrugging their 
shoulders. Interpretations of scores resulting 
from such testing may prove to be inaccurate 
if this tendency is not properly taken into 
consideration. At the same time, the examiner 
should not presume that their reticence is 
necessarily a cultural characteristic. Additional 
information (e.g., prior observations or a 
family members consultation) may be needed 
to discuss the extent of culture’s possible 
influence on linguistic performance. 

The values associated with the nature 
and degree of verbal output also may differ 
across cultures. One cultural group may judge 
verbosity or rapid speech as rude, whereas 
another may regard those speech patterns as 
indications of high mental ability or friendli¬ 
ness. An individual from one culture who is 
evaluated with values appropriate to another 
culture may be considered taciturn, with¬ 
drawn, or of low mental ability. Resulting 
interpretations and prescriptions of treatment 
may be invalid and potentially harmful to the 
individual being tested. 


Standard 9.1 

Testing practice should be designed to 
reduce threats to the reliability and validity 
of test score inferences that may arise from 
language differences. 

Comment: Some tests are inappropriate for 
use with individuals whose knowledge of 
the language of the test is questionable. 
Assessment methods together with careful 
professional judgment are required to deter¬ 
mine when language differences are relevant. 
Test users can judge how best to address this 
standard in a particular testing situation. 

Standard 9.2 

When credible research evidence reports 
that test scores differ in meaning across 
subgroups of linguistically diverse test 
takers, then to the extent feasible, test 
developers should collect for each linguistic 
subgroup studied the same form of validity 
evidence collected for the examinee popu¬ 
lation as a whole. 

Comment: Linguistic subgroups may be found 
to differ with respect to appropriateness of 
test content, the internal structure of their 
test responses, the relation of their test scores 
to other variables, or the response processes 
employed by individual examinees. Any such 
findings need to receive due consideration in 
the interpretation and use of scores as well as 
in test revisions. There may also be legal or 
regulatory requirements to collect subgroup 
validity evidence. Not all forms of evidence 
can be examined separately for members of 
ail linguistic groups. The validity argument 
may rely on existing research literature, for 
example, and such literature may not be 
available for some populations. For some 
kinds of evidence, separate linguistic sub¬ 
group analyses may not be feasible due to the 
limited number of cases available. Data may 
sometimes be accumulated so that these 
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analyses can be performed after the test has 
been in use for a period of time. It is impor¬ 
tant to note that this standard calls for more 
than representativeness in the selection of 
samples used for validation or norming stud¬ 
ies. Rather, it calls for separate, parallel analy¬ 
ses of data for members of different linguistic 
groups, sample sizes permitting. If a test is 
being used while such data are being collect¬ 
ed, then cautionary statements are in order 
regarding the limitations of interpretations 

Standard 9.3 

When testing an examinee proficient in two 
or more languages for which the test is avail¬ 
able, the examinee’s relative language profi¬ 
ciencies should be determined. The test 
generally should be administered in the test 
taker’s most proficient language, unless pro¬ 
ficiency in the less proficient language is 
part of the assessment. 

Comment: Unless the purpose of the testing 
is to determine proficiency in a particular 
language or the level of language proficiency 
required for the test is a work requirement, 
test users need to take into account the lin¬ 
guistic characteristics of examinees who are 
bilingual or use multiple languages. This may 
require the sole use of one language or use of 
multiple languages in order to minimize the 
introduction of construct-irrelevant compo¬ 
nents to the measurement process. For exam¬ 
ple, in educational settings, testing in both 
the language used in school and the native 
language of the examinee may be necessary 
in order to determine the optimal kind of 
instruction required by the examinee. 
Professional judgement needs to be used to 
determine the most appropriate procedures 
for establishing relative language proficien¬ 
cies. Such procedures may range from self- 
identification by examinees through formal 
proficiency testing. 


Standard 9.4 

Linguistic modifications recommended by 
test publishers, as well as the rationale for 
the modifications, should be described in 
detail in the test manual. 

Comment: Linguistic modifications may be 
recommended for the original test in the pri¬ 
mary language or for an adapted version in a 
secondary language, or both. In any case, the 
test manual should provide appropriate infor¬ 
mation regarding the recommended modifi¬ 
cations, their rationales, and the appropriate 
use of scores obtained using these linguistic 
modifications. 

Standard 9.5 

When there is credible evidence of score 
comparability across regular and modified 
tests or administrations, no flag should be 
attached to a score. When such evidence 
is lacking, specific information about the 
nature of the modification should be 
provided, if permitted by law, to assist 
test users properly to interpret and act 
on test scores. 

Comment: The inclusion of a flag on a test 
score where a linguistic modification was 
provided may conflict with legal and social 
policy goals promoting fairness in the treat¬ 
ment of individuals of diverse linguistic 
backgrounds. If a score from a modified 
administration is comparable to a score from 
a nonmodified administration, there is no 
need for a flag. Similarly, if a modification 
is provided for which there is no reasonable 
basis for believing that the modification 
would affect score comparability, there is no 
need for a flag. Further, reporting practices 
that use asterisks or other non-specific sym¬ 
bols to indicate that a tests administration 
has been modified provide little useful infor¬ 
mation to test users. 
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STANDARDS! 


Standard 9,6 

When a test is recommended for use with 
linguistically diverse test takers, test develop¬ 
ers and publishers should provide the infor¬ 
mation necessary for appropriate test use 
and interpretation. 

Comment: Test developers should include in 
tesc manuals and in instructions for score 
interpretation explicit statements about the 
applicability of the test with individuals who 
are not native speakers of the original lan¬ 
guage of the test. However, it should be rec¬ 
ognized that test developers and publishers 
seldom will find it feasible to conduct studies 
specific to the large number of linguistic 
groups found in certain countries. 

Standard 9.7 

When a test is translated from one language 
to another, the methods used in establishing 
the adequacy of the translation should be 
described, and empirical and logical evi¬ 
dence should be provided for score reliability 
and the validity of the translated test’s score 
inferences for the uses intended in the lin¬ 
guistic groups to be tested. 

Comment: For example, if a test is translated 
into Spanish for use with Mexican, Puerto 
Rican, Cuban, Central American, and 
Spanish populations, score reliability and the 
validity of test score inferences should be 
established with members of each of these 
groups separately where feasible. In addition, 
the test translation methods used need to be 
described in detail. 

Standard 9.8 

In employment and credentialing testing, 
the proficiency level required in the lan¬ 
guage of the test should not exceed that 
appropriate to the relevant occupation or 
profession. 


Comment: Many occupations and professions 
require a suitable facility in the language of 
the test. In such cases, a test that is used as a 
part of selection, advancement, or credential¬ 
ing may appropriately reflect that aspect of 
performance. However, the level of language 
proficiency required on the test should be no 
greater than the level needed to meet work 
requirements. Similarly, the modality in 
which language proficiency is assessed should 
be comparable to that on the job. For exam¬ 
ple, if the job requires only that employees 
understand verbal instructions in the lan¬ 
guage used on the job, it would be inap¬ 
propriate for a selection test to require 
proficiency in reading and writing that 
particular language. 

Standard 9.9 

When multiple language versions of a test 
are intended to be comparable, test develop¬ 
ers should report evidence of test compara¬ 
bility. 

Comment: Evidence of test comparability may 
include but is not limited to evidence that the 
different language versions measure equiva¬ 
lent or similar constructs, and that score relia¬ 
bility and the validity of inferences from 
scores from the two versions are comparable. 

Standard 9.10 

Inferences about test takers’ general lan¬ 
guage proficiency should be based on tests 
that measure a range of language features, 
and not on a single linguistic skill. 

Comment: For example, a multiple-choice, 
pencil-and-paper test of vocabulary does not 
indicate how well a person understands the 
language when spoken nor how well the per¬ 
son speaks the language. However, the test 
score might be helpful in determining how 
well a person understands some aspects of 
the written language. In making educational 
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placement decisions, a more complete 
range of communicative abilities (e.g., 
word knowledge, syntax) will typically 
need to be assessed. 

Standard 9.11 

When an interpreter is used in testing, the 
interpreter should be fluent in both the lan¬ 
guage of the test and the examinee’s native 
language, should have expertise in translat¬ 
ing, and should have a basic understanding 
of the assessment process. 

Comment: Although individuals with limited 
proficiency in the language of the test should 
ideally be tested by professionally trained 
bilingual examiners, die use of an interpreter 
may be necessary in some situations. If an 
interpreter is required, the professional exam¬ 
iner is responsible for ensuring that the inter¬ 
preter has the appropriate qualifications, 
experience, and preparation to assist appro¬ 
priately in the administration of the test. It is 
necessary for the interpreter to understand 
the importance of following standardized 
procedures, how testing is conducted typically, 
the importance of accurately conveying to the 
examiner an examinees actual responses, and 
the role and responsibilities of the interpreter 
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Background 

With the advancement of scientific knowledge, 
medical practices, and social policies, increasing 
numbers of individuals with disabilities are par¬ 
ticipating more fully in educational, employ¬ 
ment, and social activities. This increased 
participation has resulted in a greater need for 
the testing and assessment of individuals with 
disabilities for a variety of purposes. Individuals 
with disabilities are defined as persons pos¬ 
sessing a physical, mental, or developmental 
impairment that substantially limits one or 
more of their major life activities. Although 
the Standards focus on technical and profes¬ 
sional issues regarding the testing of individu¬ 
als with disabilities, test developers and users 
are encouraged to become familiar with federal, 
state, and local laws, and court and adminis¬ 
trative rulings that regulate the testing and 
assessment of individuals with disabilities. 

Tests are administered to individuals with 
disabilities in various settings and for diverse 
purposes. For example, tests are used for diag¬ 
nostic purposes to determine the existence and 
nature of a test takers disabilities. Testing is also 
conducted for prescriptive purposes to deter¬ 
mine intervention plans. In addition, tests are 
administered to persons who have been diag¬ 
nosed with identified disabilities for educational 
and employment purposes to make placement, 
selection, or other similar decisions, or for 
monitoring performance as a tool for educa¬ 
tional accountability. These uses of tests for 
persons with disabilities occur in a variety of 
contexts including school, clinical, counseling, 
forensic, employment, and credentialing. 

Issues Regarding Accommodation 
When Testing Individuals With 
Disabilities 

A major issue when testing individuals with 
disabilities concerns the use of accommoda¬ 


tions, modifications, or adaptations. The pur¬ 
pose of these accommodations or modifications 
is to minimize the impact of test-taker attributes 
that are not relevant to the construct that is the 
primary focus of the assessment. The terms 
accommodation and modification have varying 
connotations in different subfields. Here 
accommodation is used as the general term for 
any action taken in response to a determination 
that an individuals disability requires a departure 
from established testing protocol. Depending on 
circumstances, such accommodation may include 
modification of test administration processes or 
modification of test content. No connotation 
that modification implies a change in the con¬ 
structs) being measured is intended. 

A standardized test that has been designed 
for use with the general population may be 
inappropriate for use for individuals with specific 
disabilities if the test requires the use of sensory, 
motor, language, or psychological skills that are 
affected by the disability and that ace not rele¬ 
vant to the focal construct. For example, a person 
who is blind may read only in Braille format, 
and an individual with hemiplegia may be 
unable to hold a pencil and thus would have 
difficulty completing a standard written exam. 
In addition, some individuals with disabilities 
may possess other attendant characteristics 
(e.g., a person with a physical disability may 
fatigue easily), causing them to be further chal¬ 
lenged by some standardized testing situations. 
In these examples, if reading, use of a pencil, 
and fatigue are incidental to the construct 
intended to be measured by the cest, modifica¬ 
tions of tests and test administration procedures 
may be necessary for an accurate assessment. 

Note also that accommodations are not 
needed or appropriate under a variety of cir¬ 
cumstances. Firsc, the disability may, in fact, 
be directly relevant to the focal construct. For 
example, no accommodation is appropriate 
for a person who is completely blind if the 
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test is designed to measure visual spatial ability. 
Similarly, in employment testing it would be 
inappropriate to make test modifications if the 
test is designed to assess essential skills required 
for the job and the modifications would fun¬ 
damentally alter the constructs being measured. 
Second, an accommodation for a particular 
disability is inappropriate when the purpose of 
a test is to diagnose the presence and degree of 
that disability. For example, allowing extra 
rime on a timed test to assess the existence of a 
specific learning disability would make it very 
difficult to determine if a processing difficulty 
actually exists. Third, it is important to note 
that not all individuals with disabilities require 
special provisions when taking all tests. Many 
individuals have disabilities that would not 
influence their performance on a particular 
test, and hence no modification is needed. 

Professional judgment necessarily plays a 
substantial role in decisions about test accom¬ 
modations. Judgment comes into play in deter¬ 
mining whether a particular individual needs 
accommodation and the nature and extent of 
such accommodation. In some circumstances, 
individuals with disabilities request testing 
accommodations and provide appropriate doc¬ 
umentation in support of the request. Generally 
the request is reviewed by the agency sponsor¬ 
ing the assessment or an outside source knowl¬ 
edgeable about the assessment process and the 
type of disability. In either case, a conclusion is 
drawn as to what constitutes reasonable accom¬ 
modation. Disagreement may arise between 
the accommodation requested by an individual 
with a disability and the granted accommoda¬ 
tion. In these situations, and to the extent per¬ 
mitted by law, the overarching concern is the 
validity of the inference made from the score 
on the modified test: fairness to all parties is 
best served by a decision about test modifica¬ 
tion that results in the most accurate measure 
possible of the construct of interest. The role 
of professional judgment is further complicat¬ 
ed by the fact that empirical research on test 
accommodations is often lacking. 
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When modifying tests it is also important 
to recognize that individuals with the same type 
of disability may differ considerably in their need 
for accommodation. A central consideration in 
determining a rest modification for a disability 
is to recognize that the modifications should be 
tailored direcdy to the specific needs of individual 
test takers. As an example, it would be incorrect 
to make the assumption that all individuals with 
visual impairments would be successfully 
accommodated by providing testing materials 
in Braille format. Depending on the extent of 
the disability, it may be more appropriate for 
some individuals to receive testing materials 
written in large print, while others might need 
a tape cassette or reader. 

' As test modifications involve altering some 
aspect of a test originally developed for use with 
a target population, it is important to recognize 
that making these alterations has the potential 
to affect the psychometric qualities of the test. 
There have been few empirical investigations 
into the effects of various accommodations on 
the reliability of test scores or the validity of 
inferences drawn from modified tests. Due to a 
number of practical limitations (e.g., small 
sample size, nonrandom selection of test takers 
with disabilities), there is no precise, technical 
solution available for equating modified tests to 
the original form of these tests. Thus it is diffi¬ 
cult to compare scores from a test modified for 
persons with disabilities with scores from the 
original test. 

Modifications designed to accommodate 
persons with disabilities also may change the 
construct measured by the test, or the extent 
to which it is fully measured. For example, a 
test of oral comprehension may become a test 
of reading comprehension when administered 
in written format to a person who is deaf or 
hard of hearing. Such a change in test admin¬ 
istration may alter the construct being measured 
by the original test. When this occurs, the scores 
on the standard and modified versions of the 
test will not have the same meaning. Similarly, 
modification of test administration may also 
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alter the predictive value of test scores. For 
example, when a speed test is administered 
with relaxed time requirements to a person with 
a disability, the relationship of test scores to cri¬ 
teria such as job performance may be affected. 
Appropriate professional judgment should be 
exercised in interpreting and using scores on 
modified tests. 

Some modified tests, with accompanying 
research to support the appropriate modifica¬ 
tions, have been available for a number of years. 
Although the development of tests and testing 
procedures for individuals with disabilities is 
encouraged by the Standards , it should be noted 
that all relevant individual standards given else¬ 
where in this document are fully applicable to 
the testing applications and modifications or 
accommodations considered in this chapter. 
Issues of validity and reliability are critical when¬ 
ever modifications or accommodations occur. 

Strategies of Test Modification 

A variety of test modification strategies have 
been, implemented in various settings to accom¬ 
modate the needs of test takers with disabilities. 
Some require modifying test administration 
procedures (e.g., instructions, response format) 
while others alter test medium, timing, set¬ 
tings, or content. Depending on the nature and 
extent of the disability, one or more test modi¬ 
fication procedures may be appropriate for a 
particular individual. The listing here of a vari¬ 
ety of modification strategies should not sug¬ 
gest that the full array of strategies is routinely 
available or appropriate; the decision to modify 
rests on a determination that modification is 
needed to make valid inferences about the indi¬ 
vidual’s standing on the consrruct in question. 

Modifying Presentation Format 

One modification option is to alter the 
medium used to present the test instructions 
and items to the test takers. For example, a 
test booklet may be produced in Braille or 
large print for individuals with visual impair¬ 
ments. When tests are computer-administered, 


larger fonts or oversized computer screens may 
be used. Individuals with a hearing disability 
may receive test instructions through the use 
of sign communication or writing. 

Modifying Response Format 

Modifications also can be made to allow 
individuals with disabilities to respond to test 
items using their preferred communication 
modality. For example, an individual with severe 
language deficits might be allowed to point to 
the preferred response. A test taker who cannot 
manually record answers to test items or ques¬ 
tions may be assisted by an aide who would mark 
the answer. Other ways of obtaining a response 
include having the respondent use a tape record¬ 
er, a computer keyboard, or a Braillewriter. 

Modifying Timing 

Another modification available is to alter 
the timing of tests. This may include extended 
time to complete the test, more breaks during 
testing, or extended testing sessions over sever¬ 
al days. Many national testing programs (e.g., 
achievement, certification) allow persons with 
disabilities additional time to take the test. 
Reading Braille, using a cassette recorder, or 
having a reader may take longer than reading 
regular print. Reading large type may or may 
not be more time-consuming, depending on 
the layout of the material and on the nature 
and severity of the impairment. 

Modifying Test Setting 

Tests normally administered in group set¬ 
tings may be administered individually for a 
variety of purposes. Individual administration 
may avoid interference with others taking a 
test in a group. Some disabilities (e.g., atten¬ 
tion deficit disorder) make it impractical to 
test in a group setting. Other alterations may 
include changing the testing location if it is 
not wheelchair accessible, providing tables or 
chairs that provide greater physical support, or 
altering the lighting conditions for individuals 
who are visually impaired. 
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Using Only Portions of a Test 

Another strategy of test accommodation 
involves the use of portions of a test in assess¬ 
ing persons with disabilities. These procedures 
are sometimes used in clinical testing when cer¬ 
tain subparts of a test require physical, sensory, 
language, or other capabilities that a test taker 
with disabilities does not have. This approach 
is commonly used in cognitive and achievement 
testing when the physical or sensory limitations 
of an individual interfere with the ability to per¬ 
form on a test. For example, if a cognitive ability 
test includes items presented orally combined 
with items presented in a written fashion, the 
orally-presented items might be omitted when 
the test is given to an individual with a hearing 
disability as they will not provide an adequate 
assessment of that individuals cognitive ability. 
Results on such items are mote likely to reflea 
the individuals hearing difficulty rather than 
his or her true cognitive ability. Although 
omitting test items may represent an effective 
accommodation technique, it may also prevent 
the test from adequately measuring the intend¬ 
ed skills or abilities, especially if chose skills or 
abilities are of central interest. For example, it 
should be noted that eliminating a portion of 
the test may not be appropriate in situations 
such as certification testing and employment 
testing where the construct measured by the 
each portion may represent a separate and nec¬ 
essary job or occupational requirement. 

Using Substitute Tests or Alternate Assessments 

One additional modification is to 
replace a test standardized on the general 
population with a test or alternate assessment 
that has been specially designed for individu¬ 
als with disabilities. More valid results may 
be obtained through the use of a test specifi¬ 
cally designed for use with individuals with 
disabilities. Although a substitute test may 
represent a desirable accommodation solu¬ 
tion, it may be difficult to find an adequate 
replacement that measures the same con¬ 
struct with comparable technical quality, 
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and for which scores can be placed on the 
same scale as the original test. 

Using Modifications in Different 
Testing Contexts 

There are important contextual differences 
between the individualized use of tests, as in 
the case of clinical diagnosis, and group or 
large-scale testing, as in the case of testing for 
academic achievement, employment, creden- 
tialing, or admissions. 

Individual diagnostic testing is conducted 
typically for clinical or educational purposes. In 
these concexts a highly qualified test profession¬ 
al (e.g., a licensed or certified psychologist) is 
responsible for the entire assessment process of 
test seleaion, administration, interpretation, and 
repotting of results. The test professional seeks to 
gather appropriate information about the client’s 
specific disability and preferred modality of 
communication and uses this information to 
determine the accommodations appropriate for 
the test taker. During the assessment process, 
any modified tests are used along with other 
assessment methods to collect data about the 
client’s functioning in relevant areas. Inferences 
are then made based on this multitude of infor¬ 
mation. Test modifications may be used during 
assessment not only out of necessity but also as a 
source of clinical insight about the client’s func¬ 
tioning. For example, a test taker with obsessive 
compulsive disorder may be allowed to continue 
to complete a test item, subtest, or a total test 
beyond the standardized time limits. Although 
in such cases the performance of the test taker 
cannot be judged according to the standardized 
scoring standards, the fact that the test taker 
could produce a successful performance with 
extra time often aids clinical interpretation. 

The use of test modifications in large-scale 
testing is different, however. Large-scale testing 
is used for purposes such as measurement of 
academic achievement, program evaluation, 
credentialing, licensure, and employment. In 
these contexts, a standardized test usually is 
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administered to all test participants. Large 
numbers of test takers are not uncommon, and 
decisions may in some cases be made solely on 
the basis of test information, as in the case of 
a test used as an initial screening device in an 
employment context. In some cases, decision 
making requires the comparison of test takers, 
as in selection or admission contexts where the 
number of applicants may greatly exceed the 
number of available openings. This context 
highlights the need for concern for fairness to 
all parties, as comparisons must be made be¬ 
tween test scores obtained by individuals with 
disabilities taking modified tests and scores 
obtained by individuals under regular condi¬ 
tions. While test takers should not be disad¬ 
vantaged due to a disability not relevant to the 
construct the test is intended to assess, the 
resulting accommodation should not put those 
taking a modified test at an undue advantage 
over those tested under regular conditions. As 
research on the comparability of scores under 
regular and modified conditions is sometimes 
limited, decisions about appropriate accommo¬ 
dation in these contexts involve important and 
difficult professional judgments. 

Reporting Scores on Modified Tests 

The practice of reporting scores on modified 
tests varies in different contexts. In individual 
testing, the test professional commonly re¬ 
ports when tests have been administered in a 
nonstandardized fashion when providing test 
scores. Typically, the steps used in making test 
accommodations or modifications are described 
in the test report, and the validity of the infer¬ 
ences resulting from the modified test scores is 
discussed. This practice of reporting the nature 
of modifications is consistent with implied re¬ 
quirements to communicate information as to 
the nature of the assessment process if the mod¬ 
ifications impact the reliability of test scores or 
the validity of inferences drawn from test scores. 

On the other hand, the reporting of test 
scores from modified tests in large-scale test¬ 


ing has created considerable debate. Often 
when scores from a nonstandardized version 
of a test are reported, the score report con¬ 
tains an asterisk next to the score or some 
other designation, often called a flag, to indi¬ 
cate that the test administration was modi¬ 
fied. Sometimes recipients of these special 
designations are informed of the meaning of 
the designation; many times no information 
is provided about the nature of the modifica¬ 
tion made. Some argue that reporting scores 
from nonstandard test administrations with¬ 
out special identification misleads test users 
and perhaps even harms test takers with dis¬ 
abilities, whose scores may not accurately 
reflect their abilities. Others, however, argue 
that identifying scores of test takers with dis¬ 
abilities as resulting from nonstandard admin¬ 
istrations unfairly labels these test takers as 
persons with disabilities, stigmatizes them, 
and may deny them the opportunity to com¬ 
pete equally with test takers without disabili¬ 
ties when they might otherwise be able to do 
so. Federal laws and the laws of most states bar 
discrimination against persons with disabili¬ 
ties, require individualized reasonable accom¬ 
modations in testing, and limit practices that 
could stigmacize persons with disabilities, 
particularly in educational, admissions, cre- 
dentialing, and employment testing. 

The fundamental principles relevant 
here are that important information about 
test score meaning should not be withheld 
from test users who interpret and act on the 
test scores, and that irrelevant information 
should not be provided. When there is suf¬ 
ficient evidence of score comparability 
across regular and modified administrations, 
there is no need for any sort of flagging. 
When such evidence is lacking, an undiffer¬ 
entiated flag provides only very limited 
information to the test user, and specific 
information about the nature of the modifi¬ 
cation is preferable, if permitted by law. 
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Standard 10.1 

In testing individuals with disabilities, test 
developers, test administrators, and test 
users should take steps to ensure that the 
test score inferences accurately reflect the 
intended construct rather than any disabili¬ 
ties and their associated characteristics extra¬ 
neous to the intent of the measurement. 

Comment: Chapter 1 (Validity) deals more 
broadly with the critical requirement that a test 
score reflects the intended construct. The need 
to attend to the possibility of construct-irrele¬ 
vant variance resulting from a test takers dis¬ 
ability is an example of this general principle. 
In some settings, test users are prohibited from 
inquiring about a test takers disability, making 
the standard contingent on test taker self-report 
of a disability or a need for accommodation. 

Standard 10.2 

People who make decisions about accommo¬ 
dations and test modiRcation for individuals 
with disabilities should be knowledgeable of 
existing research on the effects of the disabil¬ 
ities in question on test performance. Those 
who modify tests should also have access to 
psychometric expertise for so doing. 

Comment: In some areas there may be little 
known about the effects of a particular disabil¬ 
ity on performance on a particular type of test. 

Standard 10.3 

Where feasible, tests that have been modified 
for use with individuals with disabilities 
should be pilot tested on individuals who have 
similar disabilities to investigate the appropri¬ 
ateness and feasibility of the modifications. 

Comment: Although useful guides for modify¬ 
ing tests are available, they do not provide a 
universal substitute for crying ouc a modified 
test. Even when such tryouts are conducted 


on samples inadequate to produce norm data, 
they are useful for checking the mechanics of 
the modifications. In many circumstances, 
however, lack of ready access to individuals 
with similar disabilities, or an inability to post¬ 
pone decision making, make this unfeasible. 

Standard 10.4 

If modifications ate made or recommended 
by test developers for test takers with specific 
disabilities, the modifications as well as the 
rationale for the modifications should be 
described in detail in the test manual and 
evidence of validity should be provided 
whenever available. Unless evidence of validi¬ 
ty for a given inference has been established 
for individuals with the specific disabilities, 
test developers should issue cautionary state¬ 
ments in manuals or supplementary materi¬ 
als regarding confidence in interpretations 
based on such test scores. 

Comment: When test developers and users 
intend that a modified version of a test should 
be interpreted as comparable to an unmodified 
one, evidence of test score comparability 
should be provided. 

Standard 10.5 

Technical material and manuals that accom¬ 
pany modified tests should include a careful 
statement of the steps taken to modify the 
tests to alert users to changes that are likely 
to alter the validity of inferences drawn from 
the test score. 

Comment: If empirical evidence of the 
nature and effects of changes resulting from 
modifying standard tests is lacking, it is 
impossible to assess the impact of significant 
modifications. Documentation of the proce¬ 
dures used to modify tests will not only aid 
in the administration and interpretation of 
the given test but will also inform others 
who are modifying tests for people with spe- 
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STANDARDS’] 


cific disabilities. This standard should apply 
to both test developers and test users. 

Standard 10.6 

If a test developer recommends specific time 
limits for people with disabilities, empirical 
procedures should be used, whenever possible, 
to establish time limits for modified forms of 
timed tests rather than simply allowing test 
takers with disabilities a multiple of the stan¬ 
dard time. When possible, fatigue should be 
investigated as a potentially important factor 
when time limits are extended. 

Comment: Such empirical evidence is likely 
only in the limited settings where a sufficient 
number of individuals with similar disabilities 
are tested. Not all individuals with the same 
disability, however, necessarily require the same 
accommodation. In most cases, professional 
judgment based on available evidence regarding 
the appropriate time limits given the nature of 
an individual s disability will be the basis for 
decisions. Legal requirements may be relevant 
to any decision on absolute time limits. 

Standard 10.7 

When sample sizes permit, the validity of 
inferences made from test scores and the 
reliability of scores on tests administered to 
individuals with various disabilities should 
be investigated and reported by the agency 
or publisher that makes the modification. 
Such investigations should examine the 
effects of modifications made for people 
with various disabilities on resulting scores, 
as well as the effects of administering stan¬ 
dard unmodified tests to them. 

Comment: In addition to modifying tests 
and test administration procedures for people 
who have disabilities, evidence of validity for 
inferences drawn from these tests is needed. 
Validation is the only way to amass knowl¬ 
edge about the usefulness of modified tests 


for people with disabilities. The costs of 
obtaining validity evidence should be consid¬ 
ered in light of the consequences of not having 
usable information regarding the meanings 
of scores for people with disabilities. This 
standard is feasible in the limited circum¬ 
stances where a sufficient number of individ¬ 
uals with the same level or degree of a given 
disability is available. 

Standard 10.8 

Those responsible for decisions about test 
use with potential test takers who may need 
or may request specific accommodations 
should (a) possess the information necessary 
to make an appropriate selection of meas¬ 
ures, (b) have current information regarding 
the availability of modified forms of the test 
in question, (c) inform individuals, when 
appropriate, about the existence of modified 
forms, and (d) make these forms available to 
test takers when appropriate and feasible. 

Standard 10.9 

When relying on norms as a basis for score 
interpretation in assessing individuals with 
disabilities, the norm group used depends 
upon the purpose of testing. Regular norms 
are appropriate when the purpose involves 
the test takers functioning relative to the 
general population. If available, normative 
data from the population of individuals with 
the same level or degree of disability should 
be used when the test takers functioning rel¬ 
ative to individuals with similar disabilities 


Standard 10.10 

Any test modifications adopted should be 
appropriate for the individual test taker, 
while maintaining all feasible standardized 
features. A test professional needs to consid¬ 
er reasonably available information about 
each test taker’s experiences, characteristics, 
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and capabilities that might impact test per¬ 
formance, and document the grounds for 
the modification. 

Standard 10.11 

When there is credible evidence of score com¬ 
parability across regular and modified admin¬ 
istrations, no flag should be attached to a 
score. When such evidence is lacking, specific 
information about the nature of the modifica¬ 
tion should be provided, if permitted by law, 
to assist test users properly to interpret and 
act on test scores. 

Comment: The inclusion of a flag on a test 
score where an accommodation for a disability 
was provided may conflict with legal and social 
policy goals promoting fairness in the treat¬ 
ment of individuals with disabilities. If a score 
from a modified administration is comparable 
to a score from a nonmodified administration, 
there is no need for a flag. Similarly, if a modi¬ 
fication is provided for which there is no rea¬ 
sonable basis for believing that the modification 
would affect score comparability, there is no 
need for a flag. Further, reporting practices that 
use asterisks or other nonspecific symbols to 
indicate that a test’s administration has been 
modified provide little useful information to 
test users. When permitted by law, if a non- 
standardized administration is to be reported 
because evidence does not exist to support 
score comparability, then this report should 
avoid referencing the existence or nature of the 
test taker’s disability and should instead report 
only the nature of the accommodation provid¬ 
ed, such as extended time for testing, the use 
of a reader, or the use of a tape recorder. 

Standard 10.12 

In testing individuals with disabilities for 
diagnostic and intervention purposes, the 
test should not be used as the sole indicator 
of the test taker’s functioning. Instead, multi¬ 
ple sources of information should be used. 


Comment: For example, when assessing the 
intellectual functioning of persons with men¬ 
tal retardation, results from an individually 
administered intelligence test are generally 
supplemented with other pertinent informa¬ 
tion, such as case history, information about 
school functioning, and results from other cog¬ 
nitive tests and adaptive behavior measures. In 
addition, at times a multidisciplinary evalua¬ 
tion (e.g., physical, psychological, linguistic, 
neurological, etc.) may be needed to yield an 
accurate picture of the person’s functioning. 
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11. THE RESPONSIBILITIES OF 
TEST USERS 


Background 

Previous chapters have dealt primarily with the 
responsibilities of those who develop, market, 
evaluate, or mandate the administration of 
tests and the rights and obligarions of test tak¬ 
ers. Many of the srandards in these chapters, 
and in the chapters that follow, refer to the 
development of tests and their use in specific 
settings. The present chapter includes standards 
of a more general nature that apply in almost 
all measurement contexts. In particular, atten¬ 
tion is centered on the responsibilities of those 
who may be considered the users of tests. This 
group includes psychologists, educators, and 
other professionals who select the specific 
instruments or supervise test administration— 
on their own authority or at the behest of oth¬ 
ers. It also includes all individuals who actively 
participate in the interpretation and use of test 
results, other than the test takers themselves. 

It is presumed that a legitimate educational, 
psychological, or employment purpose justifies 
the time and expense of test administration. In 
most settings, the user communicates this pur¬ 
pose to those who have a legitimate interest in 
the measurement process and subsequently 
conveys the implications of examinee perform¬ 
ance to those endded to receive the information. 
Depending on the measurement setting, this 
group may include individual test takers, par¬ 
ents and guardians, educators, employers, policy¬ 
makers, the courts, or the general public. 

Where administradon of tests or use of rest 
data is mandated for a specific population by 
governmental authorities, educational insti¬ 
tutions, licensing boards, or employers, the 
developer and user of an instrument may be 
essentially the same. In such settings, there 
often is no clear separation between the pro¬ 
fessional responsibilities of those who produce 
the instrument and those who administer rhe 
test and interpret the results. Instruments pro¬ 


duced by independent publishers, on the other 
hand, present a somewhat different picture. 
Typically, these tests will be used with a vari¬ 
ety of populations and for diverse purposes. 

The conscientious developer of a standard¬ 
ized test attempts to screen and educate poten¬ 
tial users. Furthermore, most publishers and 
cest sponsors work vigorously to prevent the 
misuse of standardized measures and the mis¬ 
interpretation of individual scores and group 
averages. Test manuals often illustrate sound 
and unsound interpretations and applications. 
Some identify specific practices that are not 
appropriate and should be discouraged. Despite 
the best efforts of test developers, however, 
appropriate test use and sound interpretation 
of test scores are likely to remain primarily 
the responsibility of the test user. 

Test takers, parencs and guardians, legisla¬ 
tors, policymakers, the media, the courts, and 
the public at large often yearn for unambiguous 
interpretations of test data. In particular, they 
often tend to attribute positive or negative 
results, including group differences, to a single 
factor or to the conditions that prevail in one 
social institution—most often, the home or 
the school. These consumers of test data fre¬ 
quently press for explicit rationales for decisions 
that are based only in parr on test scores. The 
wise cest user helps all interested parties under¬ 
stand that sound decisions regarding test use 
and score interpretation involve an element of 
professional judgment. It is not always obvi¬ 
ous to the consumers that the choice of vari¬ 
ous information-gathering procedures often 
involves experience that is not easily quantified 
or verbalized. The user can help them appreci¬ 
ate the fact that the weighting of quantitative 
data, educational and occupational infor¬ 
mation, behavioral observations, anecdotal 
reports, and other relevant data often cannoc 
be specified precisely. 


Ill 


AERA_APA_NCME_0000119 


JA2353 


Case 1:14-cv-00857-TSC Document 60-26 
USCA Case #17-7035 Document #1715850 


Filed 12/21/15 Page 22 of 103 
Filed: 01/31/2018 Page 50 of 517 - 


THE RESPONSIBILITIES OF TEST USERS / PART ill 


Because of the appearance of objectivity 
and numerical precision, test data are some¬ 
times allowed to totally override other sources 
of evidence about test takers. There ate circum¬ 
stances in which selection based exclusively on 
test scores may be appropriate. For example, this 
may be the case in pre-employment screening. 
But in educational and psychological settings, 
test users are well advised, and may be legally 
required, to consider other relevant sources of 
information on test takers, not just test scores. 
In the latter situations, the psychologist or 
educaror familiar with the local setting and 
with local test takers is best qualified to inte¬ 
grate this diverse information effectively. 

As reliance on test results has grown in 
recent years, greater pressure has been placed 
on test users to explain to the public the ration¬ 
ale for test-based decisions. More than ever 
before, test users are called upon to defend 
their testing practices. They do this by docu¬ 
menting that their test uses and score inter¬ 
pretations are supported by measurement 
^authorities for the given purpose, that the infer¬ 
ences drawn from their instruments are validat¬ 
ed for use with a given population, and that the 
results are being used in conjunction with other 
information, not in isolation. If these condi¬ 
tions are met, the test user can convincingly 
defend the decisions made or the administrative 
actions taken in which tests played a part. 

It is not appropriate for these Standards to 
dictate minimal levels of test-criterion correla¬ 
tion, classification accuracy, or reliability for 
any given purpose. Such levels depend on 
whether decisions must be made immediately 
on the strength of the best available evidence, 
however weak, or whether decisions can be 
delayed until better evidence becomes avail¬ 
able. But it is appropriate to expect the user to 
ascertain what the alternatives are, what the 
quality and consequences of these alternatives 
are, and whether a delay in decision making 
would be beneficial. Cost-benefit compromises 
become necessary in test use, as they often are 
in test development. It should be noted, how¬ 


ever, that in some contexts legal requirements 
may place limits on the extent to which such 
compromises can be made. As with standards 
for the various phases of test development, 
when relevant standards are not met in test 
use, the reasons should be persuasive. The 
greater the potential impact on test takers, for 
good or ill, the greater the need to identify and 
satisfy the relevant standards. 

In selecting a test and interpreting a test 
score, the test user is expected to have a clear 
understanding of the purposes of the testing 
and its probable consequences. The knowl¬ 
edgeable user has definite ideas on how to 
achieve these purposes and how to avoid bias, 
unfairness, and undesirable consequences. In 
subscribing to these Standards , test publishers 
and agencies mandating test use agree to pro¬ 
vide information on the strengths and weak¬ 
nesses of their instruments. They accept the 
responsibility to warn against likely misinter¬ 
pretations by unsophisticated interpreters of 
individual scores or aggregated data. However, 
the ultimate responsibility for appropriate test 
use and interpretation lies predominantly with 
the test user. In assuming this responsibility, 
the user must become knowledgeable about a 
test’s appropriate uses and the populations for 
which it is suitable. The user must also become 
adept, particularly in statewide and communi¬ 
ty-wide assessment programs, in communicat¬ 
ing the implications of test results to those 
entitled to receive them. 

In some instances, users may be obli¬ 
gated to collect additional evidence about a 
test’s technical quality. For example, if per¬ 
formance assessments are locally scored, evi¬ 
dence of the degree of inter-scorer agreement 
may be required. Users also should be alert 
to the probable local consequences of test 
use, particularly in the case of large-scale 
testing programs. If the same rest material 
is used in successive years, users should 
actively monitor the program to ensure chat 
reuse has not compromised the integrity of 
the results. 
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Some of the standards that follow reiterate 
ideas contained in other chapters, principally 
chapter 5 "Test Administration, Scoring, and 
Reporting,” chapter 7 “Fairness in Testing and 
Test Use,” chapter 8 “Rights and Responsibili¬ 
ties of Test Takers,” and chapter 13 “Educati¬ 
onal Testing and Assessment.” This repetition 
is intentional. It permits an enumeration in 
one chapter of the major obligations that must 
be assumed largely by the test administrator 
and user, though these responsibilities may 
refer to topics that are covered more fully in 
other chapters. 


Standard 11.1 

Prior to the adoption and use of a published 
test, the test user should study and evaluate 
the materials provided by the test developer. 
Of particular importance are those that 
summarize the test’s purposes, specify the 
procedures for test administration, define 
the intended populations of test takers, and 
discuss the score interpretations for which 
validity and reliability data are available. 

Comment: A prerequisite to sound test use is 
knowledge of the materials accompanying the 
instrument. As a minimum, these include man¬ 
uals provided by the test developer. Ideally, the 
user should be conversant with relevant studies 
reported in the professional literature. The 
degree of reliability and validity requited for 
sound score interpretations depends on the 
test’s role in the assessment process and the 
potential impact of the process on the people 
involved. The test user should be aware of 
legal restrictions that may constrain the use of 
the test. On occasion, professional judgment 
may lead to the use of instruments for which 
there is little documentation of validity for the 
intended purpose. In these situations, the user 
should interpret scores cautiously and take care 
not to imply that the decisions or inferences are 
based on test results that are well-documented 
with respect to reliability or validity. 

Standard 11.2 

When a test is to be used for a purpose for 
which little or no documentation is avail¬ 
able, the user is responsible for obtaining 
evidence of the test’s validity and reliability 
for this purpose. 

Comment: The individual who uses test scores 
for purposes that are not specifically recom¬ 
mended by the test developer is responsible 
for collecting the necessary validity evidence. 
Support for such uses may sometimes be found 
in the professional literature. If previous evidence 
is not sufficient, then additional data should be 

113 


JA2355 


AERA_APA_NCME_0000121 



Case 1:14-cv-00857-TSC Document 60-26 
USCA Case #17-7035 Document #1715850 


Filed 12/21/15 Page 24 of 103 
Filed: 01/31/2018 Page 52 of 517 


I STANDARDS 


THE RESPONSIBILITIES OF TEST USERS / PART I!! 


collected. The provisions of this standard should 
not be construed to prohibit the generation of 
hypotheses from test data. For example, though 
some clinical tests have limited or contradic¬ 
tory validity evidence for common uses, clini¬ 
cians generate hypotheses based appropriately 
on examinee responses to such tests. However, 
these hypotheses should be dearly labeled as 
tentative. Interested parties should be made 
aware of the potential limitations of the test 
scores in such situations. 

Standard 11.3 

Responsibility for test use should be assumed 
by or delegated only to those individuals who 
have the training, professional credentials, 
and experience necessary to handle this 
responsibility. Any special qualifications for 
test administration or interpretation specified 
in the test manual should be met. 

Comment: Test users should not attempt to 
interpret the scores of test rakers whose special 
needs or characteristics are outside the range of 
the user’s qualifications. This standard has spe¬ 
cial significance in areas such as clinical testing, 
forensic tesring, testing in special education, 
testing people with disabilities or limited expo¬ 
sure to the dominant culture, and in other such 
situations where potential impact is great. 
When the situation falls outside the user’s expe¬ 
rience, assistance should be obtained. A num¬ 
ber of professional organizations have codes of 
ethics that specify the qualifications of those 
who administer tests and interpret scores. 

Standard 11.4 

The test user should have a clear rationale 
for the intended uses of a test or evaluation 
procedure in terms of its validity and con¬ 
tribution to the assessment and decision¬ 
making process. 

Comment: Justification for the role of each 
instrument in selection, diagnosis, classifica¬ 
tion, and decision making should be arrived 


at before test administration, not afterwards. 
Preferably, the rationale should be available in 
printed materials prepared by the test pub¬ 
lisher or by the user. 

Standard 11.5 

Those who have a legitimate interest in an 
assessment should be informed about the 
purposes of testing, how tests will be admin¬ 
istered, the factors considered in scoring 
examinee responses, how the scores are typi¬ 
cally used, how long the records will be 
retained, and to whom and under what con¬ 
ditions the records may be released. 

Comment: This standard has greater relevance 
and application to educational and clinical test¬ 
ing than to employment testing. In most uses 
of tests for screening job applicants and appli¬ 
cants to educational programs, for licensing 
professionals and awarding credentials, or for 
measuring achievement, the purposes of testing 
and the uses to be made of the test scores are 
obvious to the examinee. Nevertheless, it is wise 
to communicate this information at least briefly 
even in these sewings. In some situations, how¬ 
ever, the rationale for the testing may be clear 
to relatively few test takers. In such settings, a 
more detailed and explicit discussion may be 
called for. Retention and release of records, 
even when such release would clearly benefit 
the examinee, are often governed by statutes 
or institutional practices. As relevant, exam¬ 
inees should be informed about these con¬ 
straints and procedures. 

Standard 11.6 

Unless the circumstances clearly require 
that the test results be withheld, the test 
user is obligated to provide a timely report 
of the results that is understandable to the 
test taker and others entitled to receive 
this information. 

Comment: The nature of score reports is often 
dictated by practical considerations. In some 
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cases only a terse printed report may be feasi¬ 
ble. In others, it may be desirable to provide 
both an oral and a written report. The inter¬ 
pretation should vary according to the level 
of sophistication of the recipient. When the 
examinee is a young child, an explanation of 
the test results is typically provided to parents 
or guardians. Feedback in the form of a score 
report or interpretation is not typically pro¬ 
vided when tests are administered for person¬ 
nel selection or promotion. 

Standard 11.7 

Test users have the responsibility to protect 
the security of tests, to the extent that devel¬ 
opers enjoin users to do so. 

Comment: When tests are used for purposes of 
selection, licensure, or educational accountabili¬ 
ty, the need for rigorous protection of test 
security is obvious. On the other hand, when 
educational tests are not part of a high-stakes 
program, some publishers consider teacher 
review of test materials to be a legitimate tool . 
in clarifying teacher perceptions of the skills 
measured by a test. Consistency and clarity in 
the definition of acceptable and unacceptable 
practices is critical in such situations. When 
tests are involved in litigation, inspection of 
the instruments should be restricted—to the 
extent permitted by law—to those who are legal¬ 
ly or ethically obligated to safeguard test security. 

Standard 11.8 

Test users have the responsibility to respect 
test copyrights. 

Comment: Legally and ethically, test users may 
not reproduce copyrighted materials for rou¬ 
tine test use without consent of the copyright 
holder. These materials—in both paper and 
electronic form—include test items, ancillary 
forms such as answer sheets or profile forms, 
scoring templates, conversion tables of raw 
scores to derived scores, and tables of norms. 


Standard 11.9 

Test users should remind test takers and 
others who have access to test materials that 
the legal rights of test publishers, including 
copyrights, and the legal obligations of other 
participants in the testing process may pro¬ 
hibit the disclosure of test items without 
specific authorization. 

Standard 11.10 

Test users should be alert to the possibility 
of scoring errors; they should arrange for 
rescoring if individual scores or aggregated 
data suggest the need for it. 

Comment: The costs of scoring error are great, 
particularly in high-stakes testing programs. 
In some cases, rescoring may be requested by 
the test taker. If such a test taker right is rec¬ 
ognized in published materials, it should be 
respected. In educational testing programs, 
users should not depend entirely on test tak¬ 
ers to alert them to the possibility of scoring 
errors. Monitoring scoring accuracy should 
be a routine responsibility of testing program 
administrators wherever feasible. 

Standard 11.11 

If the integrity of a test taker’s scores is 
challenged, local authorities, the test devel¬ 
oper, or the test sponsor should inform the 
test takers of their relevant rights, including 
the possibility of appeal and representation 
by counsel. 

Comment: Proctors in entrance or licensure 
testing programs may report irregularities 
in the test process that result in challenges. 
University admissions officers may raise chal¬ 
lenges when test scores are grossly inconsis¬ 
tent with other applicant information. Test 
takers should be apprised of their rights in 
such situations. 
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Standard 11.12 

Test users or the sponsoring agency should 
explain to test takers their opportunities, if 
any, to retake an examination; users should 
also indicate whether the earlier as well as 
later scores will be reported to those entided 
to receive the score reports. 

Comment: Some testing programs permit test 
takers to retake an examination several times, 
to cancel scores, or to have scores withheld 
from potential recipients. If test takers have 
such privileges, they and score recipients 
should be so informed. 

Standard 11.13 

When test-taking strategies that are unrelat¬ 
ed to the domain being measured are found 
to enhance or adversely affect test perform¬ 
ance significantly, these strategies and their 
implications should be explained to all test 
takers before the test is administered. This 
may be done either in an information booklet 
or, if the explanation can be made briefly, 
along with the test directions. 

Comment: Test-taking strategies, such as 
guessing, skipping time-consuming items, or 
initially skipping and then returning to diffi¬ 
cult items as time allows, can influence test 
scores positively or negatively. The effects of 
various strategies depend on the scoring sys¬ 
tem used and aspects of item and test design 
such as speededness or the number of 
response alternatives provided in multiple- 
choice items. Differential use of such strate¬ 
gies by test takers can affect the validity and 
reliability of test score interpretations. The 
goal of test directions should be to convey 
information on the possible effectiveness of 
various strategies and, thus, to provide all test 
takers an equal opportunity to perform opti¬ 
mally. The use of such strategies by all test 
takers should be encouraged if their effect 
facilitates performance and discouraged if 
their effect interferes with performance. 
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siandarti 11.14 

Test users are obligated to protect the privacy 
of examinees and institutions that are 
involved in a measurement program, unless 
a disclosure of private information is agreed 
upon, or is specifically authorized by law. 

Comment: Protection of the privacy of individ¬ 
ual examinees is a well-established principle in 
psychological and educational measurement. 
In some instances, test takers and rest admin¬ 
istrators may formally agree to a lesser degree 
of protection than the law appears to require. 
In other circumstances, test users and testing 
agencies may adopt more stringent restric¬ 
tions on the communication and sharing of 
test results than relevant law dictates. The 
more rigorous standards sometimes arise 
through the codes of ethics adopted by rele¬ 
vant professional organizations. In some test¬ 
ing programs the conditions for disclosure are 
stated to the examinee prior to testing, and 
taking the test can constitute agreement for 
the disclosure of test score information as 
specified. In other programs, the test taker or 
his/her parents or guardians must formally 
agree to any disclosure of test information to 
individuals or agencies other than those speci¬ 
fied in the test administrators published liter¬ 
ature. It should be noted that the right of the 
public and the media to examine the aggre¬ 
gate test results of public school systems is 
guaranteed in some states. 

Standard 11.15 

Test users should be alert to potential misin¬ 
terpretations of test scores and to possible 
unintended consequences of test use; users 
should take steps to minimize or avoid fore¬ 
seeable misinterpretations and unintended 
negative consequences. 

Comment: Well-meaning, but unsophisticated, 
audiences may adopt simplistic interpreta¬ 
tions of test results or may attribute high or 
low scores or averages to a single causal factor. 
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Experienced test users can sometimes antici¬ 
pate such misinterpretations and should try 
to prevent them. Obviously, not every unin¬ 
tended consequence can be anticipated. What 
is required is a reasonable effort to prevent 
negative consequences and to encourage 
sound interpretations. 

Standard 11.16 

Test users should verify periodically that 
their interpretations of test data continue to 
be appropriate, given any significant changes 
in their population of test takers, their 
modes of test administration, and their 
purposes in testing. 

Comment: Over dme, a gradual change in the 
demographic characteristics of an examinee 
population may significantly affect the infer¬ 
ences drawn from group averages. The 
accommodations made in test administration 
in recognition of examinee disabilities or in 
response to unforeseen circumstances may 
also affect interpretations. 

Standard 11.17 

In situations where the public is entitled to 
receive a summary of test results, test users 
should formulate a policy regarding timely 
release of the results and apply that policy 
consistently over time. 

Comment: In school testing programs, dis¬ 
tricts commonly viewed as a coherent group 
may avoid controversy by adopting the same 
policies regarding the release of test results. If 
one district routinely releases aggregated data 
in much greater detail than another, ground¬ 
less suspicions can develop that information 
is being suppressed in the latter district. 

Standard 11.18 

When test results are released to the public 
or to policymakers, those responsible for 
the release should provide and explain any 


supplemental information that will minimize 
possible misinterpretations of the data. 

Comment: Preliminary briefings prior to the 
release of test results can give reporters for the 
news media an opportunity to assimilate rele¬ 
vant data. Misinterpretation can often be the 
result of the limited time reporters have to 
prepare media reports or inadequate presenta¬ 
tion of information that bears on test score 
interpretation. It should be recognized, how¬ 
ever, that the interests of the media are not 
always consistent with the intended purposes 
of measurement programs. 

Standard 11.19 

When a test user contemplates an approved 
change in test format, mode of administra¬ 
tion, instructions, or the language used in 
administering the test, the user should have 
a sound rationale for concluding that validi¬ 
ty, reliability, and appropriateness of norms 
will not be compromised. 

Comment: In some instances, minor changes 
in format or mode of administration may be 
reasonably expected, without evidence, to 
have little or no effect on validity, reliability, 
and appropriateness of norms. In other 
instances, however, changes in formar or 
administrative procedures can be assumed 
a priori to have significant effects. When a 
given modification becomes widespread, con¬ 
sideration should be given to validation and 
norming under the modified conditions. 

Standard 11.20 

In educational, clinical, and counseling 
settings, a test taker’s score should not be 
interpreted in isolation; collateral informa¬ 
tion that may lead to alternative explana¬ 
tions for the examinee’s test performance 
should be considered. 

Comment: It is neither necessary nor feasible to 
make an intensive review of every test taker’s 
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score. In some settings there may be little or 
no collateral information of value. In counsel¬ 
ing, clinical, and educational settings, however, 
considerable relevant information is likely to 
be available. Obvious alternative explanations 
of low scores include low motivation, limited 
fluency in the language of the test, unfamiliar¬ 
ity with cultural concepts on which test items 
are based, and perceptual or motor impair¬ 
ments. In clinical and counseling settings, the 
test user should not ignore how well the test 
taker is functioning in daily life. 

Standard 11.21 

Test users should not rely on computer-gen¬ 
erated interpretations of test results unless 
they have the expertise to consider the 
appropriateness of these interpretations in 
individual cases. 

Comment: The scoring agency has the respon¬ 
sibility of documenting the basis for the 
interpretations. The user of a computerized 
scoring and reporting service has the obliga¬ 
tion to be familiar with the principles on 
which such interpretations were derived. 
The user should have the ability to evaluate 
a computer-based score interpretation in the 
light of other relevant evidence on each test 
taker. Automated, narrative reports are not a 
substitute for sound professional judgment. 

Standard 11,22 

When circumstances require that a test be 
administered in the same language to all 
examinees in a linguistically diverse popula¬ 
tion, the test user should investigate the 
validity of the score interpretations for test 
takers believed to have limited proficiency 
in the language of the test. 

Comment: The achievement, abilities, and 
traits of examinees who do not speak the lan¬ 
guage of the test as their primary language 
may be seriously mismeasured by the test. 


The scores of test takers with severe linguistic 
limitations will probably be meaningless. If 
language proficiency is not relevant to the 
purposes of testing, the test user should con¬ 
sider excusing these individuals, without pre¬ 
judice, from taking the test and substituting 
alternative evaluation methods. However, it 
is recognized that such actions may be 
impractical, unnecessary, or legally unaccept¬ 
able in some settings. 

Standard 11.23 

If a test is mandated for persons of a given 
age or all students in a particular grade, 
users should identify individuals whose dis¬ 
abilities or linguistic background indicates 
the need for special accommodations in test 
administration and ensure that these accom¬ 
modations are employed. 

Comment: Appropriate accommodations 
depend upon the nature of the test and the 
needs of the test taker. The mandating 
authority has primary responsibility for defin¬ 
ing the acceptable accommodations for vari¬ 
ous categories of test takers. The user must 
take responsibility for identifying those test 
takers who fall within these categories and 
implement the appropriate accommodations. 

Standard 11.24 

When a major purpose of testing is to 
describe the status of a local, regional, or 
particular examinee population, the program 
criteria for inclusion or exclusion of indivi¬ 
duals should be stricdy adhered to. 

Comment: In census-type programs, biased 
results can arise from the exclusion of particu¬ 
lar subgroups of students. Financial and other 
advantages may accrue either from exaggerat¬ 
ing or from reducing the proportion of high- 
achieving or low-achieving students. Clearly, 
these are unprofessional practices. 
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Background 

This chapter addresses issues important to 
professionals who use psychological tests with 
their clients. Topics include test selection and 
administration, test interpretation, collateral 
information used in psychological tesring, types 
of tests, and purposes of testing. The types of 
psychological tests reviewed in this chapter 
include cognitive and neuropsychological; 
adaptive, social, and problem behavior; family 
and couples; personality; and vocational. In 
addition, the chapter includes an overview of 
four common uses of psychological tests: 
diagnosis; intervention planning and outcome 
evaluation; legal and governmental decisions; 
and personal awareness, growth, and action. 

Employment testing is another context 
in which psychological testing is used. The 
standards in this chapter are applicable to those 
employment settings in which individual in- 
depth assessment is conducted (e.g., an evalu¬ 
ation of a candidate for a senior executive 
position). Employment settings in which tests 
are designed to measure specific job-related 
characteristics across multiple candidates are 
treated in the text and standards of chapter 14. 

For all professionals who use tests, knowl¬ 
edge of cultural background and physical capabil¬ 
ities that influence (a) a test takers development, 
(b) the methods for obtaining and conveying 
information, and (c) the planning and imple¬ 
mentation of interventions is critical. Therefore, 
readers are encouraged to review chapters 7, 
8, 9, and 10 that discuss fairness and bias in 
testing, the rights and responsibilities of test 
takers, testing individuals of diverse linguistic 
backgrounds, and testing individuals with 
disabilities. Readers will find important addi¬ 
tional detail on validity; reliability; test devel¬ 
opment; scaling; test administration, scoring, 
and reporting; and general responsibilities 
of test users in chapters 1, 2, 3, 4, 5, and 11, 
respectively. 


The use of tests provides one method of 
collecting information within the larger frame¬ 
work of a psychological assessment of an indi¬ 
vidual. Typically, psychological assessments 
involve an interaction between a professional 
who is trained and experienced in testing and 
a client. Clients may include patients, counse- 
lees, parents, employees, employers, attorneys, 
students, and other responsible parties who 
are test takers or who use the test results con¬ 
tained in psychological reports. 

The results from tests and inventories, used 
within the context of a psychological assessment, 
may help the professional to understand the 
client more fully and to develop more informed 
and accurate hypotheses, inferences, and deci¬ 
sions about a clients situation. A psychological 
assessment is a comprehensive examination 
undertaken to answer specific questions about 
a client’s psychological functioning during a 
particular time interval or to predict a clients 
psychological functioning in the future. An 
assessment may include administering and scor¬ 
ing tests, and interpreting test scores, all within 
the context of the individuals personal history. 
Inasmuch as test scores characteristically are 
interpreted in the context of other information 
about the client, an individual psychological 
assessment usually also includes interviewing 
the client; observing client behavior; reviewing 
educational, psychological, and other relevant 
records; and integrating these findings with 
other information that may be provided by 
third parties. The tasks of a psychological 
assessment—collecting, evaluating, integrating, 
and reporting salient information relevant to 
those aspects of a client’s functioning that are 
under examination—comprise a complex and 
sophisticated set of professional activities. 

The interpretation of tests and inventories 
can be a valuable part of the intervendon process 
and, if used appropriately, can provide useful 
information to clients as well as to other users 
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of die test interpretation. For example, the results 
of rests and inventories may be used to assess the 
psychological functioning of an individual; to 
assign diagnostic classifications; to detect neu¬ 
ropsychological impairment; to assess cognitive 
and personality strengths, vocational interests, 
and values; to determine developmental stages; 
and to evaluate treatment outcomes. Test results 
also may provide information used to make deci¬ 
sions that have a powerful and lasting impact on 
people’s lives (e.g., vocational and educational 
decision making; diagnosis; treatment planning; 
selection decisions; intervention and outcome 
evaluation; parole, sentencing, civil commit¬ 
ment, child custody, and competency to stand 
trial decisions; and personal injury litigation). 

Test Selection and Administration 

Prior to beginning the assessment process, 
the test taker should understand who will have 
access co the test results and the written report, 
how test results will be shared with the test 
taker, and if and when decisions based on the 
test results will be shared with the test taker 
and/or a third parry. The assessment process 
begins by clarifying, as much as is possible, 
the reasons for which a client is presented for 
assessment. Guided by these reasons or other 
relevant concerns, the tests, inventories, and 
diagnostic procedures to be used are chosen, 
and other sources of information needed to 
evaluate the client and the referral issues are 
identified. The professional reviews more than 
the name of the test in choosing a test and is 
guided by the validity and reliability evidence 
and the applicability of the normative data 
available in the test’s accumulated research 
literature. In addition to being thoroughly 
versed in proper administrative procedure, the 
professional is responsible for being familiar 
with the validity and reliability evidence for 
the intended use and purposes of the tests and 
inventories selected and for being prepared to 
develop a logical analysis that supports the 
various facets of the assessment and the infer¬ 
ences made from the assessment. 


Validity and reliability considerations are 
paramount, but the demographic characteris¬ 
tics (e.g., gender, age, income, sociocultural 
and language background, education and other 
socioeconomic variables) of the group for which 
the test was originally constructed and for 
which initial and subsequent normative data 
are available also are important test selection 
issues. Selecting a test with demographically 
appropriate normative groups relevant for the 
client being tested is important to the gener- 
alizability of the inferences that the professional 
seeks to make. Sometimes the items or tasks 
contained in a test are designed for a particular 
group and are viewed as irrelevant for another 
group. A test constructed for one group may 
be applied to other groups with appropriate 
qualifications that explain the test choice 
based on the supporting research data and 
on professional experience. 

The selection of psychological tests and 
inventories, for a particular client, often is 
individualized. However, in some settings a 
predetermined battery of tests may be taken by 
all participants, and group interpretations may 
be provided. The test taker may be a child, an 
adolescent, or an adult. The settings in which 
the tests or inventories are used include (but 
are not limited to) preschool, elementary, mid¬ 
dle, or secondary schools; colleges or universi¬ 
ties; pre-employment or employment settings; 
mental health or outpatient clinics; hospitals; 
prisons; or professionals’ offices. 

Professionals who oversee testing and assess¬ 
ment are responsible for ensuring that all persons 
who administer and score tests have received 
the appropriate education and training needed 
to perform these tasks. In addition, they are 
responsible in group testing situations for ensur¬ 
ing that the individuals who use the rest results 
are trained to interpret the scores properly. 

When conducting psychological testing, 
standardized test administration procedures 
should be followed. When nonstandard 
administration procedures are needed, they 
are to be described and justified. Professionals 
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also are responsible for ensuring that testing 
conditions are appropriate. For example, the 
examiner may need to determine if the client is 
capable of reading at the level required, and if 
clients with vision, hearing, or neurological dis¬ 
abilities are adequately accommodated. Finally, 
professionals are responsible for protecting the 
confidentiality and security of the test results 
and the testing materials. 

One advantage of individually adminis¬ 
tered measures is the opportunity to observe 
and adjust testing conditions as needed. In 
some circumstances, test administration may 
provide the opportunity for skilled examiners 
to carefully observe the performance of persons 
under standardized conditions. For example, 
their observations may allow them to more 
accurately record behaviors being assessed, to 
understand better the manner in which persons 
arrive at their answers, to identify personal 
strengths and weaknesses, and to make modi¬ 
fications in the testing process. Thus, the 
observations of trained professionals can be 
important to all aspects of test use. 

Test Score Interpretation 

Test scores ideally are interpreted in light 
of the available normative data, the psycho¬ 
metric properties of the test, the temporal sta¬ 
bility of the constructs being measured, and 
the effect of moderator variables and demo¬ 
graphic characteristics (e.g., gender, age, 
income, sexual orientation, sociocultural and 
language background, education, and other 
socioeconomic variables) on test results. The 
professional rarely has the resources available 
to personally conduct the research or to 
assemble representative norms needed to 
make accurate inferences about each individ¬ 
ual client’s current and future functioning. 
Therefore, the professional may rely on the 
research and the body of scientific knowledge 
available for the test that warrants appropriate 
inferences. Presentation and analyses of valid¬ 
ity and reliability evidence often are not need¬ 
ed in a written report, but the professional 


strives to understand, and prepares to articu¬ 
late, such evidence as the need arises. 

Tests and inventories chat meet high tech¬ 
nical standards of quality are a necessary but not 
a sufficient condition to ensure the responsi¬ 
ble use and interpretation of test scores. The 
level of competence of the professional who 
interprets the scores and integrates the infer¬ 
ences derived from psychological tests depends 
upon the educational and experiential qualifi¬ 
cations of the professional. With experience, 
professionals learn that the challenges in psy¬ 
chological test score interpretation increase in 
magnitude along a continuum of professional 
judgment with brief screening inventories at 
one end of the continuum and comprehensive 
multidimensional assessments at the other. For 
example, the interpretations of achievement and 
ability test scores, personality test scores, and 
batteties of neuropsychological test scores rep¬ 
resent points on a continuum that require 
increasing levels of specialized knowledge, 
judgment, and skill by an experienced profes¬ 
sional regardless of the soundness of the techni¬ 
cal characteristics of the tests being used. The 
education and experience necessary to adminis¬ 
ter group tests and/or proctor computer-admin¬ 
istered tests generally are less stringent than are 
the qualifications necessary to interpret individ¬ 
ually administered tests. The use and inter¬ 
pretation of individually administered tests 
requires completion of rigorous educational and 
applied training, a high degree of professional 
judgment, appropriate credentialing, and adher¬ 
ence to the professional’s ethical guidelines. 

When making inferences about a client’s 
past, present, and future behaviors and other 
characteristics from test scores, the professional 
reviews the literature to develop familiarity 
with supporting evidence. When there is strong 
evidence supporting the reliability and validity 
of a test, including its applicability to the client 
being assessed, the professional’s ability to draw 
inferences increases. Nevertheless, the profes¬ 
sional still corroborates results from testing with 
additional information from a variety of sources 
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such as interviews and results from other tests. 
When an inference is based on a single study 
or based on several studies whose samples are 
not representative of the client, the professional 
is more cautious about the inferences. Corrobora¬ 
ting data from the assessments multiple sources 
ofinformation—including stylistic and test-taking 
behaviors inferred from observations during 
the test—will strengthen the confidence placed 
in the inference. Importantly, data that are not 
supportive of the inference are acknowledged 
and either reconciled or noted as limits to the 
confidence placed in the inference. 

An interpretation of a test takers test scores 
based upon existing research examines not only 
the demonstrated relationship between the scores 
and the criterion or criteria, bur also the appro¬ 
priateness of the latter. The criterion and the 
chosen predictor test or tests are subjected to a 
similar examination to understand the degree to 
which their underlying constructs are congruent 
with the inferences under consideration. 

Threats to the interpretability of obtained 
scores are minimized by clearly defining how 
particular psychological tests are used. These 
threats occur as a result of construct-irrelevant 
variance (i.e., aspects of che test that are not 
relevant to the purpose of the cest scores) and 
construct underrepresentation (i.e., important 
facets relevant to the purpose of the resting, but 
for which the test does not account). A clients 
response bias is another example of a construct- 
irrelevant component that may significantly 
skew the obtained scores, possibly rendering 
the scores uninterpretable. In situations where 
response bias is anticipated, the professional 
may choose a test that has scales (e.g., faking 
good, faking bad, social desirability, percent yes, 
percent no) that clarify the threats to validity 
from the test taker’s response bias. In so doing, 
the professional may be able to assess the degree 
to which test takers are acquiescing to the per¬ 
ceived demands of the test administrator or 
attempting to portray themselves as impaired 
by “faking bad,” ot well-functioning by “faking 
good.” In interpreting the test takers obtained 
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response bias score(s), the evidence of validity 
for constructs underlying each response bias 
scale, each scales internal consistency, irs 
interrelations with other scales, and evidence 
of validity are considered. 

For some purposes, including career coun¬ 
seling and neuropsychological assessment, test 
baneries frequently are used. Such batteries often 
include tests of verbal ability, numerical ability, 
nonverbal reasoning, mechanical reasoning, 
clerical speed and accuracy, spatial ability, and 
language usage. Some batteries also include 
interest and personality inventories. When psy¬ 
chological test batteries incorporate multiple 
methods and scores, patterns of test results fre¬ 
quently are interpreted to reflect a construct or 
even an interaction among constructs underly¬ 
ing test performances. Higher order inreractions 
among the constructs underlying configurations 
of test outcomes may be postulated on the basis 
of test score patterns. The literature reporting 
evidence of reliability and validity that supports 
the proposed interpretations should be identi¬ 
fiable. If the literature is incomplete, the resulting 
inferences may be presented with the qualifica¬ 
tion that they are hypotheses for future verifi¬ 
cation rather than probabilistic statements that 
imply some known validity evidence. 

Collateral Information Used in Psychological 
Testing and Psychological Assessment 

The quality of psychological testing and 
psychological assessment is enhanced by 
obtaining credible collateral information from 
various third-party sources such as teachers, 
personal physicians, family members, and 
school or employment records. Psychological 
testing also is enhanced by using various methods 
to acquire information. Structured behavioral 
observations, checklists and ratings, interviews, 
and criterion- and norm-referenced measures 
are but a few of the methods that may be used 
to acquire information. The use of psychologi¬ 
cal tests also can be enhanced by acquiring 
information about multiple traits or attributes 
to help characterize a person. For example, an 
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evaluation of career goals may be enhanced by 
obtaining a history of current and prior employ¬ 
ment as well as by administering tests to assess 
academic aptitude and achievement, vocational 
interests, work values, and personality and tem¬ 
perament characteristics. The availability of infor¬ 
mation on multiple traits or attributes, when 
acquired from various sources and through the 
use of various methods, enables professionals to 
assess more accurately an individuals psychoso¬ 
cial functioning and facilitates more effective 
decision making. 

Types of Psychological Tests 

For purposes of this chapter, the types of psy¬ 
chological tests have been divided into five 
categories: cognitive and neuropsychological 
tests; adaptive, social, and problem behavior 
tests; family and couples tests; personality 
tests; and vocational tests. 

Cognitive and Neuropsychological Testing 

Tests often are used to assess various classes 
of cognidve and neuropsychological functioning 
including intelligence; broad ability domains 
(e.g., verbal, quantitative, and spatial abilities); 
and more focused domains (e.g., attention, 
sensorimotor functions, perception, learning, 
memory, reasoning, executive functions, and 
language). Overlap may occur in the constructs 
that are assessed by tests of differing functions 
or domains. In common with other types of 
tests, cognitive and neuropsychological tests 
require a minimally sufficient level of test-taker 
attentional capacity. 

Cognitive Ability. Measures designed to 
quantify cognitive abilities ate among the most 
widely administered tests. The interpretation of 
cognitive ability tests is guided by the theoretical 
constructs used to develop the test. 

Many cognitive ability tests consist of mul¬ 
tidimensional test batteries that are designed 
to assess a broad range of abilities and skills. 
Individually administered test batteries also are 
required for testing for purposes such as diag¬ 


nosing a cognitive disorder. Test results are used 
to draw inferences about a persons overall level 
of intellectual functioning as well as strengths 
and weaknesses in various cognitive abilities. 
Because each test in a battery examines a dif¬ 
ferent function, ability, skill, or combination 
thereof, the test taker’s performance can be 
understood best when scores are not combined 
or aggregated, but rather when each score is 
interpreted within the context of all other 
scores and other assessment data. For example, 
low scores on timed tests alert the examiner to 
slowed responding as a problem that may not 
be apparent if scores on different kinds of tests 
are combined. 

Attention. Attention refers to that class 
of functioning that encompasses arousal, estab¬ 
lishment and deployment of sets, sustained 
attention, and vigilance as constructs. Tests 
may measure levels of alertness, orientation, 
and localization; the ability to focus, shift, and 
maintain attention and to track one or more 
stimuli under various conditions; span of 
attention; information processing speed and 
choice reaction time; and short-term informa¬ 
tion storage capacity. Scores for each aspect of 
attention that has been examined should be 
reported individually so that the nature of an 
attention disorder can be clarified. 

Motor, Sensorimotor Functions, and 
Lateral Preferences. Visual, auditory, somato¬ 
sensory and other sensory sensitivity and dis¬ 
crimination can be measured by simple motor 
or verbal responses to selective stimulation 
upon command. 

Perception and Perceptual Organiza¬ 
tion/Integration. This class of functioning 
involves reasoning and judgment as they relate 
to the processing and elaboration of complex 
sensory combinations and inputs. Tests of per¬ 
ception may emphasize immediate perceptual 
processing but also may require conceptualiza¬ 
tions that involve some reasoning and judg¬ 
mental processes. Some tests have a motor 
component ranging from a simple motor 
response to an elaborate construction. Also, 
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some of these tests penalize the test taker for 
slow performance that may be caused by some¬ 
thing other than perceptual dysfunction. 

Learning and Memory. This class of 
functions involves the acquisition and retention 
of information beyond the attentional require¬ 
ments of immediate or short-term information 
processing and storage. These tests may measure 
acquisition of new information through various 
sensory channels and by means of assorted test 
formats (e.g., word lists, prose passages, geomet¬ 
ric figures, formboards, digits, and musical 
melodies). Memory tests also may require 
retention and recall of old information (e.g., 
personal data as well as commonly learned 
facts and skills). 

Abstract Reasoning and Categorical 
Thinking. Tests of reasoning and thinking 
vary widely. They assess the examinees ability 
to infer relationships ot to respond to changing 
environmental circumstances and to act in 
goal-oriented situations. 

Executive Functions. This class of func¬ 
tions is involved in the organized performances 
that are necessary for the independent, purpo¬ 
sive and effective attainment of personal goals 
in various cognitive processing, problem-solv¬ 
ing and social situations. Some tests emphasize 
reasoned plans of action thac anticipate conse¬ 
quences of alternative solutions, motor perform¬ 
ance in problem-solving situations that require 
goal-oriented intentions, and regulation of per¬ 
formance for achieving a desired outcome. 

Language. Language assessment typically 
focuses on phonology, morphology, syntax, 
semantics, and pragmatics. Recepcive and 
expressive language functions may be assessed, 
including listening, reading, talking, and writ¬ 
ten language skills and abilities. Assessment of 
central language disorders focuses on function¬ 
al speech and verbal comprehension measured 
through oral, written, or gestural modes; lexi¬ 
cal access and elaboration; repetition of spoken 
language; and associative verbal fluency. 

When assessing persons who are non¬ 
native English speakers or who are bilingual or 
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multilingual, language assessment often includes 
an assessment of language competence and the 
order of dominance among the different lan¬ 
guages. If a multilingual person is assessed for 
a possible language disorder, one issue for the 
professional ro consider is the degree to which 
the disorder may be due more directly to lan¬ 
guage-related qualities (e.g., phonological, 
morphological, syntactic, semantic, pragmatic 
delays; mental retardation; peripheral sensory 
or central neurological impairment; psycholog¬ 
ical conditions; hearing disorders) than to 
dominance of a non-English language. 

Academic Achievement. Academic 
achievement tests are measures of academic 
knowledge and skills that a person has acquired 
in formal and informal learning opportunities. 
Two major types of academic achievement 
tests include general achievement batteries and 
diagnostic achievement tests. General achieve¬ 
ment batteries are designed to assess a person’s 
level of learning in multiple areas (e.g., reading, 
mathematics, spelling, social studies, science). 
Diagnostic achievement tests, on the other 
hand, typically focus on one particular subject 
area (e.g., reading) and assess important aca¬ 
demic skills in greater detail. Test results are 
used to determine the test taker's strengths as 
well as specific difficulties and may help identi¬ 
fy sources of the difficulties and ways to over¬ 
come them. Chapter 13 provides additional 
detail on academic achievement testing in 
educational sewings. 

Social, Adaptive, and Problem Behavior Testing 

Measures of social, adaptive, and problem 
behaviors assess ability and motivation to care 
for one’s self and to relate to others. Adaptive 
behaviors include a repertoire of knowledge, 
skills, and abilities that enable a person to meet 
the daily demands and expectations of the 
environment, such as eating, dressing, using 
transportation, interacting with peers, com¬ 
municating with others, making purchases, 
managing money, maintaining a schedule, 
remaining in school, and maintaining a job. 
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Problem behaviors indude behavioral adjust¬ 
ment difficulties that interfere with a persons 
effective functioning in daily life situations. 

Family and Couples Testing 

Family testing addresses the issues of family 
dynamics, cohesion, and interpersonal relations 
among family members induding partners, par¬ 
ents, children, and extended family members. 
Tests developed to assess families and couples 
are distinguished by measuring the interaction 
patterns of partial or whole families, requiring 
simultaneous focus on two or more family 
members in terms of their transactions. Testing 
with couples may address personal factors such 
as issues of intimacy, compatibility, shared 
interests, trust, and spiritual beliefs. 

Personality Testing 

Broadly considered, the assessment of per¬ 
sonality requires a synthesis of aspects of an 
individuals functioning that contribute to the 
formulation and expression of thoughts, atti¬ 
tudes, emotions, and behaviors. In the assess¬ 
ment of an individual, cognitive and emotional 
functioning may be considered separately, but 
their influences are interrelated. For example, a 
person whose perceptions are highly accurate, 
or who is relatively stable emotionally, may be 
able to control suspiciousness better than can a 
person whose perceptions are inaccurate or dis¬ 
torted or who is emotionally unstable. 

Scores on a personality test may be regard¬ 
ed as reflecting the underlying theoretical con¬ 
structs or empirically derived scales or factors 
that guided the test’s construction. The stimu¬ 
lus and response formats of personality tests 
vary widely. Some include a series of questions 
(e.g., self-report inventories) to which the test 
taker is required to choose from several well- 
defined options; others involve being placed in a 
novel situation in which the tesc takers response 
is not completely structured (e.g., responding to 
visual stimuli, telling stories, discussing pictures, 
or responding to other projective stimuli). The 
responses are scored and combined into either 


logically or statistically derived dimensions 
established.by previous research. 

Personality tests may be designed to focus 
on the assessment of normal or abnormal atti¬ 
tudes, feelings, traits, and related characteristics. 
Tests intended to measure normal personality 
characteristics are constructed to yield scores 
reflecting the degree to which a person mani¬ 
fests personality dimensions empirically iden¬ 
tified and hypothesized to be present in the 
behavior of most individuals. A person’s config¬ 
uration of scores on these dimensions is then 
used to infer how the person behaves presendy 
and how she/he may behave in new situations. 
Test scores outside of the expected range may 
be considered extreme expressions of normal 
traits or indicative of psychopathology. Such 
scores also may reflect normal functioning of 
the person within a culture different from that 
of the normative population sample. 

Other personality tests are designed specif¬ 
ically to measure constructs undedying abnormal 
functioning and psychopathology. Developers 
of some of these tests use previously diagnosed 
individuals to construct their scales and base 
their inferences on the association between the 
test’s scale scores, within a given range, and the 
behavioral correlates of persons who scored 
within that range. If inferences made from 
scores go beyond the theory that guided the 
test’s construction, then the inferences must be 
validated by collecting and analyzing additional 
relevant data. 

Vocational Testing 

Vocational testing generally includes the 
measurement of interests, work needs, and 
values, as well as consideration and assessment 
of related elements of career development, 
maturity, and indecision. The results from 
inventories that assess these constructs often 
are used for enhancing personal growth and 
understanding, career counseling, outplace¬ 
ment counseling, and vocational decision 
making. These interventions frequently take 
place in the context of educational settings. 
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However, interest inventories and measures of 
work values also may be used in workplace set¬ 
tings as part of training and development pro¬ 
grams, for career planning, or for selection, 
placement, and advancement decisions. 

Interest Inventories. The measurement of 
interests is designed to identify a person’s pref¬ 
erences for various activities. Self-report interest 
inventories are widely used to assess personal 
preferences including likes and dislikes for vari¬ 
ous work and leisure activities, school subjects, 
occupations, or types of people. The resulting 
scores may provide insight into types and pat¬ 
terns of differential interests in educational cur¬ 
ricula (e.g., college majors), in different fields 
of work (e.g., specific occupations), or in more 
general or basic areas of interests related to spe¬ 
cific activities (e.g., sales, office practices, or 
mechanical activities). 

Work Values Inventories. The measure¬ 
ment of work values identifies a persons pref¬ 
erences for the various reinforcements one may 
obtain from work activities. Sometimes these 
values are identified as needs that persons seek 
to satisfy. Work values or needs may be catego¬ 
rized as intrinsic and important for the pleasure 
gained from the activity (e.g., independence, 
ability utilization, achievement) ot as extrinsic 
and important for the rewards they bring (e.g., 
coworkers, supervisory relations, working 
conditions). The format of work values tests 
usually involves a self-rating of the impor¬ 
tance of the value associated with qualities 
described by rhe items. 

Measures of Career Development, 
Maturity, and Indecision. Additional areas of 
vocational assessment include measures of 
career development and maturity and measures 
of career indecision. Inventories that measure 
career development and maturity typically elic¬ 
it client self-descriptions in response to items 
that inquire about the individual’s knowledge 
of the world of work; self-appraisal of one’s 
decision-making skills; attitudes toward careers 
and career choices; and the degree to which 
the individual already has engaged in career 
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planning. Measures of career indecision usual¬ 
ly are constructed and standardized to assess 
both the level of career indecision of a client 
as well as the reasons for, or antecedents of, 
indecision. Such career development, maturi¬ 
ty, and indecision findings may be used with 
individuals and groups to guide the design 
and delivery of career services and to evaluate 
the effectiveness of career interventions. 

Purposes of Psychological Testing 

For purposes of this chapter, psychological test 
uses have been divided into four categories; 
testing for diagnosis; intervention planning and 
outcome evaluation; legal and governmental 
decisions; and personal awareness, growth and 
action. However, these categories are not always 
mutually exclusive. 

Testing for Diagnosis 

Diagnosis refers to a process that includes 
the collection and integration of test results 
with prior and current information about a 
person together with relevant contextual con¬ 
ditions to identify characteristics of healthy 
psychological functioning as well as psycholog¬ 
ical disorders. Disorders may manifest them¬ 
selves in information obtained during the 
testing of an individual s cognitive, emotional, 
social, personality, neuropsychological, physi¬ 
cal, perceptual, and motor attributes. 

Psychodiagnosis. Psychological tests are 
helpful to professionals involved in the psycho¬ 
logical diagnosis of an individual. Testing may 
be performed to confirm a hypothesized diagno¬ 
sis or to rule out alternative diagnoses. Psycho¬ 
diagnosis is complicated by the prevalence of 
comorbidity between diagnostic categories. For 
example, a client diagnosed as suffering from 
schizophrenia simultaneously may be diagnosed 
as suffering from depression. Or, a child diag¬ 
nosed as having a learning disability also may 
be diagnosed as suffering from an attention 
deficit disorder. The goal of psychodiagnosis is 
to assist each client in receiving the appropriate 
interventions for the psychological or behavioral 
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dysfunctions that the client, or a third party, 
views as impairing the clients expected func¬ 
tioning and/or enjoyment of life. In developing 
treatment plans, professionals often use non- 
categoricai diagnostic descriptions of client 
functioning along treatment-relevant dimen¬ 
sions (e.g., degree of anxiety, amount of suspi¬ 
ciousness, openness to interpretations, amount 
of insight into behaviors, and level of intellec¬ 
tual functioning). 

The first step in evaluating a test’s suit¬ 
ability to yield scores or information indicative 
of a particular diagnostic syndrome is to com¬ 
pare the construct that the test is intended to 
measure with the symptomatology described in 
the diagnostic criteria. This step is important 
because different diagnostic systems may use 
the same diagnostic term to describe different 
symptoms; even within one diagnostic system 
the symptoms described by the same term may 
differ between editions of the manual identify¬ 
ing the diagnostic criteria. Similarly, a test that 
uses a diagnostic term in its title may differ sig¬ 
nificantly from another test using a similar title 
or froth a subscale with the same term. For 
example, some diagnostic systems may define 
depression by behavioral symptomatology 
(e.g., psychomotor retardation, disturbance in 
appetite or sleep) or by affective symptomatol- 
ogy (e.g., dysphoric feeling, emotional flatness) 
or by cognitive symptomatology (e.g., thoughts 
of hopelessness, morbidity) or some other 
symptomatology. Further, rarely are the symp¬ 
toms of diagnostic categories mutually exclu¬ 
sive. Hence, it can be expected that a given 
symptom may be shared by several diagnostic 
categories. More knowledgeable and precisely 
drawn inferences relating to a diagnosis may be 
obtained from test scores if appropriate weight 
is given to the symptoms included in the diag¬ 
nostic category and to the suitability of each 
test to assess the symptoms. 

Different methods may be used to assess 
particular diagnostic categories. Some methods 
rely primarily on structured interviews using a 
“yes” or “no” format in which the professional 


is interested in the presence or absence of diag¬ 
nosis-specific symptomatology. Other methods 
often rely principally on tests of personality or 
cognitive functioning and use configurations of 
obtained scores. These configurations of scores 
indicate the degree to which a clients respons¬ 
es are similar to those of individuals who have 
been determined by prior research to belong to 
a specific diagnostic group. 

Diagnoses made with the help of test scores 
typically are based on empirically demonstrat¬ 
ed relationships between the test score and the 
diagnostic category. Validity studies that demon¬ 
strate relationships between test scores and 
diagnostic categories currendy are available for 
some diagnostic categories. Sometimes tests that 
do not have supporting validity studies also may 
be useful to the professional in arriving at a 
diagnosis. This also may occur, for example, 
when the symptoms assessed by a test are a 
subset of the criteria that comprise a particular 
diagnostic category. While it often is not feasi¬ 
ble for individual professionals to personally 
conduct research into relationships between 
obtained scores and inferences, their familiarity 
with the body of the research literature that 
examines these relationships is important. 

The professional often can enhance the 
diagnostic inferences derived from test scores 
by integrating the test results with inferences 
made from other sources of information regard¬ 
ing the client’s functioning such as self-reported 
history or information provided by significant 
others or systematic observations in the natural 
environment or in the testing setting. In arriv¬ 
ing at a diagnosis, a professional also looks for 
information that does not corroborate the 
diagnosis, and in those instances, places appro¬ 
priate limits on the degree of confidence placed 
in the diagnosis. When relevant to the referral 
issue, the professional acknowledges alternative 
diagnoses that may require consideration. 
Particular attention is paid to all relevant avail¬ 
able data before concluding that a client falls 
into a diagnostic category. Cultural sensitivity 
is paramount to avoid misdiagnosing and over 
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pathoiogizing culturally appropriate behavior, 
affect or cognition. Tests also are used to assess 
the appropriateness of continuing the initial 
diagnostic characterization, especially after a 
course of treatment or if the client’s psycholog¬ 
ical functioning has changed over time. 

Neuropsychodiagnosis. Neuropsycho¬ 
logical testing analyzes the current psychological 
and behavioral status, including manifestations 
of neurological, neuropathological, and neuro¬ 
chemical changes that may arise during devel¬ 
opment or from brain injury or illness. The 
purposes of neuropsychological testing typically 
include, but are not limited to, the following: 
differential diagnoses between psychogenic and 
neurogenic sources of cognitive, perceptual, and 
personality dysfunction; differential diagnoses 
between two or more suspected etiologies of 
cerebral dysfunction; evaluation of impaired 
functioning secondary to a cerebral, cortical, or 
subcortical event; establishment of neuropsy¬ 
chological baseline measurements for monitoring 
progressive cerebral disease or recovery effects; 
comparison of pre- and post-pharmacologic, 
surgical, behavioral, or psychological interven¬ 
tions; identification of patterns of higher cortical 
function and dysfunction for the formulacion 
of rehabilitation strategies and for the design of 
remedial procedures; and characterizing brain- 
behavior functions to assist the trier of fact in 
criminal and civil legal actions. 

Testing tor Intervention Planning and Outcome 
Evaluation 

Professionals often rely on test results for 
assistance in planning, executing, and evaluat¬ 
ing interventions. Therefore, their awareness of 
validity information that supports or does not 
support the relationship between test results, 
prescribed interventions, and desired outcome 
is important. Interventions may be intended to 
prevent the onset of one or more symptoms, to 
stabilize or overcome them, to ameliorate their 
effects, to minimize their impact, and to pro¬ 
vide for a persons basic physical, psychological, 
and social needs. Intervention planning typical¬ 


ly occurs following an evaluation of the nature 
and severity of a disorder and a review of person¬ 
al and contextual conditions that may impact its 
resolution. Subsequent evaluations may occur 
in an effort to diagnose further the nature and 
severity of the disorder, to review the effects of 
interventions, to revise them as needed, and to 
meet ethical and legal standards. 

Testing for Judicial and Governmental Decisions 

Clients may voluntarily seek psychological 
testing as part of psychological assessments to 
assist in matters before a court or other govern¬ 
mental agencies. Conversely,, courcs or other 
governmental agencies sometimes require a 
client to submit involuntarily to a psychological 
or neuropsychological assessment that may 
involve a wide range of psychological tests. The 
goal of these psychological assessments is to 
provide important information to a third party, 
clients attorney, opposing attorney, judge, or 
administrative board about the psychological 
functioning of the client that has bearing on 
the legal issues in question. At the outset of 
evaluations for judicial and government deci¬ 
sions, it is imperative to clarify the purpose of 
the evaluation, who will have access to the test 
results and the reports, and any rights that 
the client may have to refuse to participate in 
court-ordered evaluations. 

The goals of psychological testing in judi¬ 
cial and governmental settings are informed and 
constrained by the legal issues to be addressed, 
and a detailed understanding of their salient 
aspects is essential. Legal issues may arise as 
part of a civil proceeding (e.g., involuntary 
commitment, testamentary capacity, compe¬ 
tence to stand trial, parole, child custody, per¬ 
sonal injury, discrimination issues), a criminal 
proceeding (e.g., competence to stand trial, not 
guilcy by reason of insanity, mitigating circum¬ 
stances in sentencing), determination of rea¬ 
sonable accommodations for employees with 
disabilities, or an administrative proceeding or 
decision (e.g., license revocation, parole, work¬ 
er’s compensation). Each of these legal issues is 
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defined in law applicable to a particular legisla¬ 
tive jurisdiction. The definition of each legal 
issue may be jurisdiction specific. For example, 
the criteria by which a person can be involun¬ 
tarily committed often differ between legisla¬ 
tive jurisdictions. Furthermore, tests initially 
administered for one purpose also may be used 
for another purpose (e.g., initially used for a 
civil case but later used in administrative or 
criminal proceedings). 

Legislatures, courts, and other adminstra- 
tive bodies often define legal issues in common¬ 
ly used language, not in diagnostic or other 
technical psychological terms. The professional 
is responsible for explaining the diagnostic frame 
of reference, including test scores and inferences 
made from them, in terms of the legal criteria 
by which the jury, judge, or administrative board 
will decide the legal issue. For example, a diag¬ 
nosis of schizophrenia or neuropsychological 
impairment, which does not also include a ref¬ 
erence to the legal criteria, neither precludes an 
examinee from obtaining sole custody of children 
in a child custody dispute nor does it necessar¬ 
ily acquit a person of criminal responsibility. 

In instances involving legal or quasi-Iegal 
issues, it is important to assess the examinees 
test-taking orientation including response bias 
to ensure that the legal proceedings have not 
affected the responses given. For example, a 
person seeking to obtain the greatest possible 
monetary award for a personal injury may be 
motivated to exaggerate cognitive and emotional 
symptoms, while persons attempting to forestall 
the loss of a professional license may attempt to 
portray rhemselves in the best possible light by 
minimizing symptoms or deficirs. In forming 
an assessment opinion, it is necessary to inter¬ 
pret the test scores with informed knowledge 
relating to the available validity and reliability 
evidence. When forming such opinions, it also 
is necessary to integrate a clients test scores with 
all other sources of information that bear on 
current status including psychological, medical, 
educational, occupational, legal, and other rel¬ 
evant collateral records. 


Some tests are intended to provide informa¬ 
tion about a clients functioning that helps clarify 
a given legal issue (e.g., parental functioning in 
a child custody case or ability to understand 
charges against a defendant in competency to 
stand trial matters). The manuals of some tests 
also provide demographic and actuarial data 
for normative groups that are representative of 
persons involved in the legal system. However, 
many tests measure constructs that are generally 
relevant to the legal issues even though norms 
specific to the judicial or governmental context 
may not be available. Professionals are expected 
to make every effort to be aware of evidence of 
validity and reliability that supports or does not 
support their inferences and to place appropri¬ 
ate limits on the opinions rendered. Test users 
who practice in judicial and government set¬ 
tings are expected to be aware of conflicts of 
interest that may lead to bias in the interpreta¬ 
tion of test results. 

Protecting the confidentiality of a clients 
test results and of the test instrument itself poses 
particular challenges for professionals involved 
with attorneys, judges, jurors, and other legal 
and quasi-Iegal decision makers. The test taker 
does have a right to expea that test results will 
be communicated only to persons who are 
legally authorized to receive them and that 
other information from the testing session that 
is not relevant to the evaluation will not be 
reported. It is important for the professional to 
be apprised of possible threats to confidentiality 
and test security (e.g., releasing the test questions, 
the examinees responses, and raw and scaled 
scores on tests to another qualified profession¬ 
al) and to seek, if necessary, appropriate legal 
and professional remedies. 

Testing for Personal Awareness, Growth, 
and Action 

Tests and inventories frequently are used 
to provide information to help individuals to 
understand themselves, to identify their own 
strengths and weaknesses, and to otherwise 
clarify issues important to their own decision 
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making and development. For example, test 
results from personality inventories may help 
clients berter understand themselves and also 
understand theit interactions with others. 
Results from interest inventories and tests of 
ability may be useful to individuals who are 
making educational and career decisions. 
Appropriate cognitive and neuropsychological 
tests that have been normed and standardized 
for children may facilitate the monitoring of 
development and growth during the formative 
years when relevant interventions may be more 
efficacious for preventing potentially disabling 
learning disabilities from being overlooked or 
misdiagnosed. 

Test results may be used for self-exploration, 
self-growth, and decision making in several 
ways. First, the results can provide individuals 
with new information that allows them to 
compare themselves with others or to evaluate 
themselves by focusing on self-descriptions and 
characterizations. Test results also may serve to 
stimulate discussions between a client and pro¬ 
fessional, to facilitate client insights, to provide 
directions for future considerations, to help 
individuals identify strengths and assets, and to 
provide the professional with a general frame¬ 
work for organizing and integrating informa¬ 
tion about an individual. Testing for personal 
growth may take place in training and develop¬ 
ment programs, within an educational curricu¬ 
lum, during psychotherapy, in rehabilitation 
programs as part of an educational or career 
planning process, or in other situations. 

Summary 

The application of psychological tests condnues 
to expand in scope and depth on a course that 
is characterized by an increasingly diverse set of 
purposes, procedures, and assessment needs and 
challenges. Therefore, the responsible use of 
tests in practice requires a commitment by the 
professional to develop and maintain the nec¬ 
essary knowledge and competence to select, 
administer, and interpret tests and inventories 
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as crucial elements of the psychological testing 
and assessment process. The standards in this 
chapter provide a framework for guiding the 
professional toward achieving relevance and 
effectiveness in the use of psychological tests 
within the boundaries or limits defined by the 
professionals educational, experiential and ethi¬ 
cal foundations. Earlier chapters and srandards 
that are relevant to psychological testing and 
assessment describe general aspects of test quali¬ 
ty (chapters 1-6, chapter 11), test fairness 
(chapters 7-10), and test use (chapter 11). 
Chapter 13 discusses educational applications; 
chapter 14 discusses test use in the workplace, 
including eredentialing, and the importance of 
collecting data that provide evidence of a test’s 
accuracy for predicting job performance; and 
chapter 15 discusses test use in program evalua¬ 
tion and public policy. 
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STANDARDS I 


Standard 12.1 

Those who use psychological tests should 
confine their testing and related assess¬ 
ment activities to their areas of compe¬ 
tence, as demonstrated through education, 
supervised training, experience, and appro¬ 
priate credentialing. 

Comment: The responsible use and interpreta¬ 
tion of test scores require appropriate levels of 
experience and sound professional judgment. 
Competency also requires sufficient familiarity 
with the population from which the test taker 
comes to allow appropriate interaction, test 
selection, test administration, and test inter¬ 
pretation. For example, when personality tests 
and neuropsychological tests are administered 
as part of a psychological assessment of an 
individual, the test scores must be understood 
in the context of the individual’s physical and 
emotional state, as well as the individual’s cul¬ 
tural, educational, occupational, and medical 
background, and must take into account other 
evidence relevant to the tests used. Test inter¬ 
pretation in this context requires professional¬ 
ly responsible judgment that is exercised 
within the boundaries of knowledge and 
skill afforded by the professional’s education, 
training, and supervised experience. 

Standard 12.2 

Those who select tests and interpret test 
results should refrain from introducing bias¬ 
es that accommodate individuals or groups 
with a vested interest in decisions affected 
by the test interpretation. 

Comment: Individuals or groups with a vested 
interest in the significance or meaning of the 
findings from psychological testing include 
many school personnel, attorneys, referring 
health professionals, employers, professional 
associates, and managed care organizations. In 
some settings a professional may have a profes¬ 
sional relationship with multiple clients (e.g., 


with both the test taker and the organization 
requesting assessment). A professional engaged 
in a professional relationship with multiple 
clients takes care to ensure that the multiple 
relationships do not become a conflict of inter¬ 
est that would occur when the professional’s 
judgment toward one client is unduly influ¬ 
enced by his or her relationship with the other 
client. Test selections and interpretations that 
favor a special external expectation or perspec¬ 
tive by deviating from established principles of 
sound test interpretation are unprofessional 
and unethical. 

Standard 12.3 

Tests selected for use in individual testing 
should be suitable for the characteristics and 
background of the test taker. 

Comment: Considerations for test selection 
should include culture, language and/or physi¬ 
cal requirements of the test and the availability 
of norms and evidence of validity for a popula¬ 
tion representative of the test taker. If no nor¬ 
mative or validity studies are available for the - 
population at issue, test interpretations should 
be qualified and presented as hypotheses rather 
than conclusions. 

Standard 12.4 

If a publisher suggests that tests are to be used 
in combination with one another, the profes¬ 
sional should review the evidence on which the 
procedures for combining tests is based and 
determine the rationale for the specific combi¬ 
nation of tests and the justification of the 
interpretation based on the combined scores. 

Comment: For example, if measures of developed 
abilities (e.g., achievement or specific or general 
abilities) or personality are packaged with inter¬ 
est measures to suggest a requisite combination 
of scores, or a neuropsychological battery is 
being applied, then supporting validity data for 
such combinations of scores should be available. 


131 


AERA_APA_NCME_0000139 


JA2373 



Case 1:14-cv-00857-TSC Document 60-26 
USCA Case #17-7035 Document #1715850 


Filed 12/21/15 Page 42 of 103 
Filed: 01/31/2018 Page 70 of 517 


! STANDARDS 


PSYCHOLOGICAL TESTING AND ASSESSMENT / PART III 


Standard 12.5 

The selection of a combination of tests to 
address a complex diagnosis should be 
appropriate for the purposes of the assessment 
as determined by available evidence of validity. 
The professionals educational training and 
supervised experience also should be com¬ 
mensurate with the test user qualifications 
required to administer and interpret the 

Comment: For example, in a neuropsychologi¬ 
cal assessment for evidence of an injury to a 
particular area of the brain, it is necessary to 
select a combination of tests of known diag¬ 
nostic sensitivity and specificity to impair¬ 
ments arising from trauma to various regions 
of the cerebral hemispheres. 

Standard 12.6 

When differencial diagnosis is needed, the 
professional should choose, if possible, a test 
for which there is evidence of the test’s ability 
to distinguish between the two or more diag¬ 
nostic groups of concern rather than merely 
to distinguish abnormal cases from the gen¬ 
eral population. 

Comment: Professionals will find it particularly 
helpful if evidence of validity is in a form thac 
enables them to determine how much confi¬ 
dence can be placed in inferences regarding an 
individual. Differences between group means 
and their statistical significance provide inade¬ 
quate information regarding validity for 
individual diagnostic purposes. Additional 
information might consist of confidence inter¬ 
vals, effect sizes, or a table showing the degree 
of overlap of predictor distributions among 
different criterion groups. 

Standard 12.7 

When the validity of a diagnosis is appraised 
by evaluating the level of agreement between 
test-based inferences and the diagnosis, the 


diagnostic terms or categories employed 
should be carefully defined or identified. 

Standard 12.8 

Professionals should ensure that persons 
under their supervision, who administer and 
score tests, ate adequately trained in the set¬ 
tings in which the testing occurs and with 
the populations served. 

Standard 12.9 

Professionals responsible for supervising 
group testing programs should ensure that 
the individuals who interpret the test scores 
are properly instructed in the appropriate 
methods for interpreting them. 

Comment: If, for example, interest inventories 
are given to college students for use in aca¬ 
demic advising, the professional who super¬ 
vises the academic advisors is responsible for 
ensuring that the advisors know how to pro¬ 
vide an examinee an appropriate interpretation 
of the test results. 

Standard 12.10 

Prior to testing, professionals and test 
administrators should provide the test taker 
with appropriate introductory information 
in language understandable to the test taker. 
The test taker who inquires also should be 
advised of opportunities and circumstances, 
if any, for retesting. 

Comment: The client should understand test¬ 
ing time limits, who will have access to the 
test results, if and when test results will be 
shared with the test taker, and if and when 
decisions based on the test results will be 
shared with the test taker. 

Standard 12.11 

Professionals and others who have access to 
test materials and test results should ensure 
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STANDARDS I 


the confidentiality of the test results and 
testing materials consistent with legal and 
professional ethics requirements. 

Comment: Professionals should be knowledge¬ 
able and conform to record-keeping and con¬ 
fidentiality guidelines required by the state or 
province in which they practice and the pro¬ 
fessional organizations to which they belong. 
Confidentiality has different meanings for the 
test developer, the test user, the test taker, and 
third parties (e.g., school, court, employer). 
To the extent possible, the professional who 
uses tests is responsible for managing the con¬ 
fidentiality of test information across all par¬ 
ties. It is important foe the professional to be 
aware of possible threats to confidentiality and 
the legal and professional remedies available. 
Professionals also are responsible for main¬ 
taining the security of testing materials and 
for protecting the copyrights of all tests to the 
extent permitted by law. 

Standard 12.12 

The professional examines available norms 
and follows administration instructions, 
including calibration of technical equip¬ 
ment, verification of scoring accuracy and 
replicability, and provision of settings for 
testing that facilitate optimal performance 
of test takers. However, in those instances 
where realistic rather than optimal test set¬ 
tings will best satisfy the assessment purpose, 
the professional should report the reason for 
using such a setting and, when possible, also 
conduct the testing under optimal conditions 
to provide a comparison. 

Comment: Because the normative data against 
which a clients performance will be evaluated 
were collected under the reported standard 
procedures, the professional needs to be aware 
of and take into account the effect that non¬ 
standard procedures may have on the client’s 
obtained score. When the professional uses 


tests that employ an unstructured response 
format, such as some projective techniques 
and informal behavioral ratings, the profes¬ 
sional should follow objective scoring criteria, 
where available and appropriate, that are clear 
and minimize the need for the scorer to rely 
only on individual judgment. The testing may 
be conducted in a realistic, less than optimal, 
setting to determine how a client with an 
attentional disorder, for example, performs in a 
noisy or distracting environment rather than 
in an optimal environment that typically 
protects the test taker from such external 
chreacs to performance efficiency. 

Standard 12.13 

Those who select tests and draw inferences 
from test scores should be familiar with the 
relevant evidence of validity and reliability 
for tests and inventories used and should be 
prepared to articulate a logical analysis that 
supports all facets of the assessment and the 
inferences made from the assessment. 

Comment: A presentation and analysis of 
validity and reliability evidence generally is 
not needed in a written report, because it is 
too cumbersome and of little interest to most 
report readers. However, in situations in which 
the selection of tests may be problematic (e.g., 
verbal subtests with deaf clients), a brief 
description of the rationale for using or not 
using particular measures is advisable. 

When potential inferences derived from 
psychological test data are not supported by 
evidence of validity yet may hold promise for 
future validation, they may be described by 
the test developer and professional as hypothe¬ 
ses for further validation in test interpretation. 
Such interpretive remarks should be qualified 
to communicare to the source of the referral 
that such inferences do not as yet have ade¬ 
quately demonstrated evidence of validity and 
should not be the basis for a diagnostic deci¬ 
sion or prognostic formulation. 
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Standard 12.14 

The interpretation of test results in the 
assessment process should be informed 
when possible by an analysis of stylistic and 
other qualitative features of test-taking 
behavior that are inferred from observations 
during interviews and testing and from 
historical information. 

Comment: Such features of test-taking behavior 
include manifestations of fatigue, momentary 
fluctuations in emotional state, rapport with 
the examiner, test takers level of motivation, 
withholding or distortion of response as seen 
in instances of deception and malingering or 
in instances of pseudoneurological conditions, 
and unusual response or general adaptation to 
the testing environment. 

Standard 12.15 

Those who use computer-generated inter¬ 
pretations of test data should evaluate the 
quality of the interpretations and, when 
possible, the relevance and appropriateness 
of the norms upon which the interpretations 
are based. 

Comment: Efforts to reduce a complex set of 
data into computer-generated interpretations 
of a given construct may yield grossly mis¬ 
leading or simplified analyses of meanings of 
test scores, that in turn may lead to faulty 
diagnostic and prognostic decisions as well 
as mislead the trier of fact in judicial and 
government settings. 

Standard 12.16 

Test interpretations should not imply that 
empirical evidence exists for a relationship 
among particular test results, prescribed 
interventions, and desired outcomes, unless 
empirical evidence is available for popula¬ 
tions similar to those representative of the 
examinee. 


Standard 12.17 

Criterion-related evidence of validity should 
be available when recommendations or deci¬ 
sions are presented by the professional as 
having an actuarial basis. 

Standard 12.18 

The interpretation of test or test battery 
results generally should be based upon mul¬ 
tiple sources of convergent test and collateral 
data and an understanding of the normative, 
empirical, and theoretical foundations as 
well as the limitations of such tests. 

Comment: A given pattern of test perform¬ 
ances represents a cross-sectional view of the 
individual being assessed within a particular 
context (i.e., medical, psychosocial, educa¬ 
tional, vocational, cultural, ethnic, gender, 
familial, genetic, and behavioral). The inter¬ 
pretation of findings derived from a complex 
battery of tests in such contexts requires 
appropriate education, supervised experience, 
and an appreciation of procedural, theoreti¬ 
cal, and empirical limitations of the tests. 

Standard 12.19 

The interpretation of test scores or patterns 
of test battery results should take cognizance 
of the many factors that may influence a 
particular testing outcome. Where appropri¬ 
ate, a description and analysis of the alterna¬ 
tive hypotheses or explanations that may 
have contributed to the pattern of results 
should be included in the report. 

Comment: Many factors (e.g., unusual testing 
conditions, motivation, educational level, 
employment status, lateral sensorimotor usage 
preferences, health, or disability status) may 
influence individual testing results. When 
such factors are known to introduce con- 
srruct-irrelevanr variance in component test 
scores, those factors should be considered 
during test score interpretations. 
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standards! 


Standard 12.20 

Except for some judicial or governmental 
referrals, or in some employment testing sit¬ 
uations when the client is the employer, pro¬ 
fessionals should share test results and 
interpretations with the test taker. Such 
information should be expressed in language 
that the test taker, or when appropriate 
the test taker’s legal representative, can 
understand. 

Comment: For example, in rehabilitation set¬ 
tings, where clients typically are required to 
participate actively in intervention programs, 
sharing of such information, expressed in 
terms that can be understood readily by the 
client and family members, may facilitate the 
effectiveness of intervention. 
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Background 

This chapter concerns testing in formal educa¬ 
tional settings from kindergarten through post¬ 
graduate training. Results of tests administered 
to students are used to make judgments, for 
example, about the status, progress, or accom¬ 
plishments of individuals or groups. Tests that 
provide information about individual perform¬ 
ance are used to (a) evaluate a student's overall 
achievement and growth in a content domain, 
(b) diagnose student strengths and weaknesses 
in and across content domains, (c) plan educa¬ 
tional interventions and to design individual¬ 
ized instructional plans, (d) place students in 
appropriate educational programs, (e) select 
applicants into programs with limited enroll¬ 
ment. and (0 certify individual achievement or 
qualifications. Tests that provide information 
about the status, progress, or accomplishments 
of groups such as schools, school districts, or 
states are used (a) to judge and monitor the 
quality of educational programs for all or for 
particular subsets of individuals, and (b) to 
infer the success of policies and interventions 
chat have been selected for evaluation. These 
cesting purposes are typically mandated by 
institutions such as schools and colleges and 
by governing bodies of public and privately 
administered educational programs. 

In this chapter, three broad areas of edu¬ 
cational testing are considered that encompass 
one or more of the above purposes: (a) routine 
school, district, state, or other system-wide 
testing programs; (b) testing for selection in 
higher education; and (c) individualized and 
special needs testing. While the second and 
third areas refer to relatively specific purposes 
of testing, system-wide testing programs can 
encompass multiple individual and group pur¬ 
poses. For each of these areas, the chapter elab¬ 
orates on the specific purposes and domains 
encompassed and raises specific issues of tech¬ 


nical quality and fairness in testing that may 
not be addressed or emphasized in the preced¬ 
ing chapters. This chapter does not explicitly 
address issues related to tests constructed and 
administered by teachers for their own class¬ 
room use or provided by publishers of instruc¬ 
tional materials. While many aspects of the 
Standards, particularly those in the areas of 
validity, reliability, test development, and fair¬ 
ness, are relevant to such tests, this document 
is not intended for tests used by teachers for 
their own classroom purposes. 

Issues in Educational Testing 

This chapter first considers some cross-cutting 
issues: the distinctions among types of tests, the 
design or use of tests to serve multiple pur¬ 
poses including the measurement of change, 
and the “stakes” associated with different pur¬ 
poses for testing in education. 

Distinctions Among Types of Tests and 
Assessments 

Tests used in educational settings range 
from tests consisting of traditional item formats 
such as multiple-choice items to performance 
assessments including scorable portfolios. Every 
test, regardless of its format, measures test-taker 
performance in a specified domain. Performance 
assessments, however, attempt to emulate the 
context or conditions in which the intended 
knowledge or skills are actually applied. As dis¬ 
cussed in chapter 3, they are diverse in nature 
and can be product-based as well as behavior- 
based. The execution of the tasks posed in these 
tests often involves relatively extended time 
periods, ranging from a few minutes to a class 
period or more to several hours or days. 
Examples of such performances might include 
solving problems using manipulable materials, 
making complex inferences after collecting 
information, or explaining orally or in writing 
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the rationale for a particular course of govern- 
ment action under given economic conditions. 
The performance task may be undertaken by 
a single individual or a team of students. 
Performance assessments may require increased 
testing time to provide sufficient domain sam¬ 
pling for reasonable estimates of individual 
attainment and for making generalizations to 
the broader domain. Extended time periods, 
collaboration, and the use of ancillary materials 
pose great challenges to the standardization of 
administration and scoring of some perform¬ 
ance assessments. This is particularly true when 
test takers define their own tasks or when they 
select their own work products for evaluation. 
When this is the case, test takers need to be 
aware of the basis for scoring as well as the na¬ 
ture of the criteria that will be applied. Further, 
performance assessments often require com¬ 
plex procedures and training to increase the 
accuracy of judgments made by those evaluat¬ 
ing student performance (see chapter 3). 

An individual portfolio may be used as 
another type of performance assessment. 
Scorable portfolios are systematic collections of 
educational products typically collected over 
time and possibly amended over time. The 
particular purpose of the portfolio determines 
whether it will include representative products, 
the best work of the student, or indicators of 
progress. The purpose also dictates who will be 
responsible fot compiling the contents of the 
portfolio—the examiner, the student, or both 
parries working together. The more standard¬ 
ized the contents and procedures of administra¬ 
tion, the easier it is to establish comparability of 
portfolio-based scores. Establishing comparabil¬ 
ity requires portfolios to be constructed accord¬ 
ing to test specifications and standards, and the 
development of objective procedures to judge 
their quality. The test specifications for portfo¬ 
lios may indicate that students are to make cer¬ 
tain decisions about the nature of the work to be 
included. For example, in constructing an art 
portfolio, students may select the media that 
best represent their work. Establishing compa¬ 
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rability also requires specifications regarding the 
kinds of assistance the student may have received 
during portfolio preparation. It is particularly 
difficult to compare the performance of students 
whose portfolios may vary in content. All per¬ 
formance assessments, including scorable portfo¬ 
lios, are judged by rhe same standards of Techni¬ 
cal quality as traditional tests of achievement. 

Electronic media are often used both to 
present testing material and to record and score 
test takers’ responses. These tests may be admin¬ 
istered in schools, in special laboratory sertings, 
or in external testing centers. Examples include 
simple enhancements of text by audio-taped 
instructions to facilitate student understand¬ 
ing, computer-based tests traditionally given in 
paper-and-pencil format, computer-adaptive 
tests, and newer, interactive multimedia testing 
situations where attributes of performance 
assessments are supported by computer. Some 
computer-based tests also may have the capacity 
to capture aspects of students’ processes as they 
solve test items. They may, for example, monitor 
time spent on items, solutions tried and rejected, 
or editing sequences for texts. Electronic media 
also make it possible to provide test adminis¬ 
tration conditions designed to assist students 
with particular needs, such as those with dif¬ 
ferent language backgrounds, attention prob¬ 
lems, or physical disabilities. Computers can 
also help identify the contributions of individ¬ 
uals to a group task completed by a team or in 
geographically remote locations on a network. 

Computer-based tests are evaluated by the 
same technical quality standards as other tests 
administered through more traditional means. 
It is especially important that test takers be 
familiarized with the media of the test so that 
any unfamiliarity with computers or strategies 
does not lead to inferences based on construct- 
irrelevant variance. Furthermore, ir is important 
to describe scoring algorithms, expert models 
upon which they may be based, and technical 
data supporting their use in any documenta¬ 
tion accompanying the testing system. It is 
important, however, to assure that the docu- 
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mentation does not jeopardize the security of 
the items that could adversely affect the valid¬ 
ity of score interpretations. Some computer- 
based tescs may also generate recommendations 
for instructional practices based on test results. 
Describing the basis for these recommenda¬ 
tions assists the user in evaluating their appli¬ 
cability in a given situation. 

Multiple Purposes and Measuring Change 

Many tests are designed or used to serve 
multiple purposes in education. For example, a 
test may be used to monitor individual student 
achievement as well as to evaluate the quality 
of educational programs at the school or dis¬ 
trict level. As another example, a test may be 
used to evaluate an individual’s performance 
relative to the performance of one or more ref¬ 
erence populations as well as to evaluate the 
level of the individual’s competence in some 
defined domain (see chapters 3 and 4). The 
evidence needed for the technical quality of one 
purpose, however, will differ from the evidence 
needed for another purpose. Consequently, it 
is important to evaluate the evidence of techni¬ 
cal quality for each purpose of testing. 

Test results may be used to infer the growth 
or progress as well as the status of individuals 
or groups of students, such as when tests are 
expected to reveal the effects of instruction, 
of changes in educational policy, or of other 
interventions. In such cases, the test’s ability to 
detect change is essential, if differences in scores 
are reported, the technical quality of the dif¬ 
ferences needs attention. More generally, 
whenever inferences about growth or progress 
are made, it is important to evaluate the validi¬ 
ty of those inferences. 

Stakes of Testing 

The importance of the results of testing 
programs for individuals, institutions, or groups 
is often referred to as the stakes of the testing 
program. At the individual level, when signifi¬ 
cant educational paths or choices of an individual 
are direcdy affected by test performance, such as 


whether a student is promoted or retained at a 
grade level, graduated, or admitted or placed 
into a desired program, the test use is said to 
have high stakes. A low-stakes test, on the other 
hand, is one administered for informational 
purposes or for highly tentative judgments such 
as when test results provide feedback to students, 
teachers, and parents on student progress dur¬ 
ing an academic period. Tescing programs for 
institutions can have high stakes when aggre¬ 
gate performance of a sample or of the entire 
population of test takers is used to infer the 
quality of service provided, and decisions are 
made about institutional status, rewards, or 
sanctions based on test results. For example, 
the quality of reading curriculum and instruc¬ 
tion may be judged on the basis of tesc results 
because test scores can indicate the rate of stu¬ 
dent progress or the levels of attainment reached 
by groups of students. Even when test results 
are reported in the aggregate and intended for 
a low-stakes purpose such as monitoring the 
educational system, the public release of data 
can raise the stakes for particular schools or 
districts. Judgments about program quality, 
personnel, and educational programs might 
be made and policy decisions might be affect¬ 
ed, even though the tests were not intended 
or designed for those purposes. 

The higher the stakes associated with a 
given test use, the more important it is that 
test-based inferences are supported with strong 
evidence of technical quality. In particular, 
when the stakes for an individual are high, and 
important decisions depend substantially on test 
performance, the test needs to exhibit higher 
standards of technical quality for its avowed 
purposes than might be expected of tests used 
for lower-stakes purposes (see chapters 1, 2, and 
7 for a more thorough discussion on validity, 
reliability, and bias in testing, respectively). 
Although it is never possible to achieve perfect 
accuracy in describing an individual’s perform¬ 
ance, efforts need to be made to minimize errors 
in estimating individual scores or in classifying 
individuals in pass/feil or admit/reject categories. 
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Further, enhancing validity for high-stakes 
purposes, whether individual or institutional, 
typically entails collecting sound collateral 
information both to assist in understanding 
the (actors that contributed to test results and 
to provide corroborating evidence that supports 
inferences based on test results. These issues 
will be addressed more fully as they relate to 
the three areas of testing described below. 

School, District, State, or Other 
System-Wide Testing Programs 

As indicated previously, system-wide testing 
programs can span multiple purposes. At the 
individual level, tests are used for low-stakes 
purposes, such as monitoring and providing 
feedback on student progress, and for more 
high-stakes purposes, such as certifying stu¬ 
dents’ acquisition of particular knowledge and 
skills for promotion, placement into special 
instructional programs, or graduation. At the 
school, district, state, or other aggregate level, 
a common purpose of tests is to evaluate the 
progress made by, groups of students or to 
monitor the long-term effectiveness of the 
overall educational system. Educational test¬ 
ing programs may also permit comparisons 
among the performance of various groups of 
students in different programs or in diverse 
settings for the purpose of making an evalua¬ 
tion of those learning environments. Chapter 
15 provides a more thorough discussion on 
program evaluation. 

In these contexts, educational tests are 
designed to measure certain aspects of stu¬ 
dents’ knowledge and skills as reflected in cur¬ 
riculum goals and standards. There may be 
considerable variation in the breadth and 
depth of the knowledge and skills that are 
measured by such tests. Some educational 
tests focus on the test takers’ general ability or 
knowledge in a particular content area, such as 
their understanding of mathematics or science. 
Other tests focus on test takers’ specific knowl¬ 
edge of a topic in detail, such as trigonometry. 


Still others emphasize specific skills or proce¬ 
dures, such as the ability to write persuasively 
or to design, conduct, and interpret the results 
of a scientific experiment. Tests may address 
other cognitive aspects of test takers’ develop¬ 
ment, such as their ability to work with others 
to solve problems or their self-reported habits 
and attitudes, as well as noncognitive aspects, 
such as students’ ability to perform particular 
physical tasks. In most cases, valid interpreta¬ 
tion of the results requires that evidence of the 
fit between the test domain and the relevant 
curriculum goals or standards be ascertained. 

Testing programs may involve the use of 
tests designed to represent a set of general edu¬ 
cational standards as determined for instance 
by the state, district, or relevant educational 
professional organization. Such tests are con¬ 
ceptually similar to criterion-referenced tests, 
in that a set of content standards is developed 
that is intended to provide broad specifica¬ 
tions for student performance by delimiting 
the content and general skills to be measured. 
Subsequently, descriptive or empirical targets 
or levels of achievement are developed and 
referred to as performance standards. These 
performance standards are intended to define 
further the knowledge and skills required of 
studencs for each of the different categories 
of proficiency. 

This type of testing may involve the devel¬ 
opment of a new test to assess the relevant 
content and skills or the selection of an exist¬ 
ing cest that can be referenced to the standards. 
Whether a test is designed or selected, valid 
interpretation of the results in light of the stan¬ 
dards entails assessment of the degree of fit 
becween the test domain and contents and the 
descriptive statements of standards or goals. 
This involves a process of mapping or referenc¬ 
ing the content and skills of the test to those of 
the standards to be sure that gaps ot imbal¬ 
ances do not occur. The curriculum goals or 
standards may be sufficiently broad to encom¬ 
pass many different ways for students to 
demonstrate their status, accomplishments, or 
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progress. Moreover, some goals or standards 
may not lend themselves to conventional test 
formats. These are cases in which the test may 
result in construct underrepresentation that 
refers to the extent to which a test fails to cap¬ 
ture important aspects of what it is intended to 
measure. Chapter 1 provides a more thorough 
discussion of construct underrepresentation. 

In these cases, interpretation of test results in 
light of goals or standards is enhanced by an 
understanding of what is not covered as well 
as what is covered by rhe test. Sometimes, 
additional commercial or locally developed 
tests are administered within a particular juris¬ 
diction, and attempts are made to link these 
existing tests to the proficiency levels reported 
for the new test or to provide other evidence 
of comparability. It is important to provide 
logical and empirical validity evidence of any 
reported links. For example, evidence can be 
collected to determine the extent to which the 
existing test can provide information about the 
proficiency of individual students and groups 
of students in the particular content areas and 
skills addressed by the standards. The validity 
of such links is problematic to the extent that 
the tests measure different content (see chapter 
4 for a discussion on issues in equating and 
linking tests). 

When inferences are to be drawn about the 
performance of groups of students, practical 
considerations and the format of the test (e.g., 
performance assessment) often dictate that dif¬ 
ferent subgroups of students within each unit 
respond to different sets of tasks or items, a pro¬ 
cedure referred to as matrix sampling. This 
matrix sampling approach allows for a test to 
better represent the breadth of the target domain 
without increasing the testing time for each test 
taker, Group-level results are most useful when 
testing programs and student populations 
remain sufficiently stable to provide informa¬ 
tion about trends over time. When a testing 
program is designed for group-level reporting 
and employs matrix sampling, reporting indi¬ 
vidual scores generally is not appropriate. 


When interpreting and using scores about 
individuals or groups of students, considera¬ 
tion of relevant collateral information can 
enhance the validity of the interpretation, by 
providing corroborating evidence or evidence 
that helps explain student performance. Test 
results can be influenced by multiple factors, 
including institutional and individual factors 
such as the quality of education provided, 
students’ exposure to education (e.g., through 
regular school attendance), and students’ 
motivation to perform well on the rest. 

As the stakes of testing increase for indi¬ 
vidual students, the importance of considering 
additional evidence to document the validity 
of score interpretations and the fairness in test¬ 
ing increases accordingly. The validity of indi¬ 
vidual interpretations can be enhanced by 
taking into account other relevant information 
about individual students before making 
important decisions. It is important to consider 
the soundness and relevance of any collateral 
information or evidence used in conjunction 
with test scores for making educational decisions. 
Further, fairness in testing can be enhanced 
through careful consideration of conditions that 
affect students’ opportunities to demonstrate 
their capabilities. For example, when tests are 
used for promotion and graduation, the fairness 
of individual interpretations can be enhanced 
by (a) providing students with multiple oppor¬ 
tunities to demonstrate their capabilities 
through repeated testing with alternate forms 
or through other construct-equivalent means, 
(b) ensuring students have had adequate notice of 
skills and content to be tested along with other 
appropriate test preparation material, (c) pro¬ 
viding students with curriculum and instruc¬ 
tion that affords them the opportunity to learn 
the content and skills that are tested, and (d) 
providing students with equal access to any 
specific preparation for test taking (e.g., test¬ 
taking strategies). Chapter 7 provides a more 
thorough discussion on fairness in testing. 

Collateral information can also enhance 
interpretation and decisions at the institutional 


141 


AERA_APA_NCME_0000148 


JA2382 



Case 1:14-cv-00857-TSC Document 60-26 Filed 12/21/15 Page 51 of 103 
USCA Case #1,7-7035 Document #1715850 Filed: 01/31/2018 Page 79 of 517 

EDUCATIONAL TESTING AND ASSESSMENT / PART III 


level. For instance, changes in test scores from 
year to year may not only reflect changes in 
the capabilities of students but also changes 
in the student population (e.g., successive 
cohorts of students). Differences in scores 
across ethnic groups may be confounded with 
differences in socioeconomic status of the 
communities in which they live and, hence, 
the educational resources to which students 
have access. Differences in scores from school 
to school may similarly reflect differences in 
resources and activities such as the qualifica¬ 
tion of teachers or the number of advanced 
course offerings. While local empirical evi¬ 
dence of the influence of these factors may not 
be readily available, consideration of evidence 
from similar contexts available in published 
literature can enhance the quality of the inter¬ 
pretation and use of current results. 

Because public participation is an integral 
part of educational governance, policymakers, 
professional educators, and members of the 
public are concerned with the nature of educa¬ 
tional tests, the domains that the tests are 
intended to measure, the choices in test design, 
adoption, and implementation, and the issues 
associated with valid interpretation and uses 
of test results. It is important that test results 
be reported in a way chat all stakeholders can 
understand, that enables sound interpretations, 
and that decreases the chance of misinterpreta¬ 
tions and inappropriate decisions. 

Large-scale testing is increasingly viewed 
as a tool of educational policy. From this per¬ 
spective, tests used for program evaluation, 
such as some state tests that are aligned to the 
state’s own curriculum standards, are not used 
solely as measures of school outcomes (see 
chapter 15 for a more thorough discussion on 
the use of tests for program evaluation). They 
are also viewed as a means to influence cur¬ 
riculum and instruction, to hold teachers and 
school administrators accountable, to increase 
student motivation, and to communicate per¬ 
formance expectations to students, to teachers, 
and to the public. If such goals are set forth as 
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pan of the rationale for a testing program, the 
validity of the testing program needs to be 
examined with respecc to these goals. Beyond 
any intended policy goals, it is important to 
consider potential unintended effects that 
may result from large-scale testing programs. 
Concerns have been raised, for instance, about 
narrowing the curriculum to focus only on 
the objectives tested, restricting the range of 
instructional approaches to correspond to 
the testing format, increasing the number of 
dropouts among students who do not pass the 
test, and encouraging other instructional or 
administrative practices that may raise test 
scores without affecting the quality of educa¬ 
tion. It is important for those who mandate 
tests to consider and monitor their conse¬ 
quences and to identify and minimize the 
potential of negative consequences. 

Selection in Higher Education 

It is widely recognized that tests are used in the 
selection of applicants for admission to partic¬ 
ular educational programs, especially admis¬ 
sions to colleges, universities, and professional 
schools. Selection criteria may vary within 
an institution by academic specialization. In 
addition to scores from selection tests, many 
other sources of evidence are used in making 
selection decisions, including past academic 
records, transcripts, and grade-point average 
or rank in class. Scores on tests used to certify 
students for high school graduation may be 
used in the college admissions process. Other 
measures used by some institutions are samples 
of previous work by students, lists of academic 
and service accomplishments, letcers of rec¬ 
ommendation, and student-composed state¬ 
ments evaluated for the appropriateness of 
the goals and experience of the student or 
for writing proficiency. 

Two major points may be made about the 
role of tests in the admissions process. Often, 
scores are used in combination with other 
sources of information. Some of these supple- 
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mental sources of evidence may not be reliably 
assessed or may lack comparability from appli¬ 
cant to applicant. For this reason, it is impor¬ 
tant that studies be conducted examining the 
relationships among test scores, data from 
other sources of information, and college per¬ 
formance. Second, the public and policymak¬ 
ers are to be caucious about the widespread 
use of reports of college admission test scores 
to infer the effectiveness of middle school and 
high school as well as to compare schools or 
states. Admissions tests, whether they are 
intended to measure achievement or ability, 
are not directly linked to a particular instruc¬ 
tional curriculum and, therefore, are not 
appropriate for detecting changes in middle 
school or high school performance. Because 
of differential motivational factors and other 
demographic variables found across and within 
pre-collegiate programs, self-selection precludes 
general comparisons of test scores across demo¬ 
graphic groups. Therefore, self-selection also 
precludes comparisons of test scores among 
the full ranges of pre-collegiate programs. 

individualized and Special Needs 
Testing 

Individually administered tests are used by 
school psychologists and other professionals 
in schools and other related settings to 
facilitate the learning and development of 
students who may have special educational 
needs (see chapter 12). Some of these services 
are reserved for those students who have gift¬ 
ed capabilities as well as for those students 
who may have relatively minor academic dif¬ 
ficulties (e.g., such as those requiring reme¬ 
dial reading). Other services are reserved for 
students who display behavioral, emotional, 
physical, and/or more severe learning diffi¬ 
culties. Services may be provided to students 
who are in regular classroom settings as well 
as to students who need more specialized 
instruction outside of the regular classroom. 
The ultimate purpose of these services is to 


assure all students are placed into appropriate 
educational programs. 

Individually administered tests can serve 
a number of purposes, including screening, 
diagnostic classification, intervention planning, 
and program evaluation. For screening purpos¬ 
es, tests are administered to identify students 
who might differ significantly from their peers 
and might require additional assessment. For 
example, screening tests may be used to identi¬ 
fy young children who show signs of develop¬ 
mental disorders and to signal the need for 
further evaluation. For diagnostic purposes, 
tests may be used to clarify the types and 
extent of an individual’s difficulties or prob¬ 
lems in light of well-established criteria. Test 
results provide an important basis for deter¬ 
mining whether the student meets eligibility 
requirements for special education and other 
related services and, if so, the specific types 
of services that the student needs. Test results 
may be used for intervention purposes in 
establishing behavior and learning goals and 
objectives for the student, planning instruc¬ 
tional strategies that should be used, and speci¬ 
fying the appropriate secting in which the 
special services are to be delivered (e.g., regular 
classroom, resource room, full-time special 
class, etc.). Subsequent to the student’s place¬ 
ment in special services, tests may be adminis¬ 
tered to monitor the progress of the student 
toward prescribed learning goals and objec¬ 
tives. Test results may be used also to evaluate 
the effectiveness of instruction to determine 

whether the special services need to be contin¬ 
ued, modified, or discontinued. 

Many types of tests are used in individual¬ 
ized and special needs testing. These include 
tests of cognitive abilities, academic achieve¬ 
ment, learning processes, visual and auditory 
memory, speech and language, vision and 
hearing, and behavior and personality. These 
tests are used typically in conjunction with 
other assessment methods such as interviews, 
behavioral observation, and review of records. 
Each of these may provide useful data for mak- 
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ing appropriate decisions about a student. In 
addition, procedures that aim to link assess¬ 
ment closely to intervention may be used, 
including behavioral assessments, assessments 
of learning environments, curriculum-based 
tests, and portfolios. Regardless of the qualities 
being assessed and types of data collection 
methods employed, assessment data used in 
making special education decisions are evaluat¬ 
ed in terms of validity, reliability, and relevance 
to the specific needs of the students. They 
must also be judged in terms of their useful¬ 
ness for designing appropriate educational pro¬ 
grams for students who have special needs. 

The amount and complexity of the assess¬ 
ment data required for making various deci¬ 
sions about a student will vary depending on 
the purpose of testing, the needs of the stu¬ 
dent, and other information already available 
about the student (e.g., current scores on a rel¬ 
evant test may be on file for some students but 
not for others). In general, testing for screening 
and program evaluation purposes typically 
involves the use of one or two tests rather than 
comprehensive test batteries. For determining 
eligibility and designing intervention, testing 
and assessment is mote comprehensive and 
may involve multiple procedures and sources. 
Moreover, in-depth analyses and interpretation 
of the data are necessary. 

In special education, tests are selected, 
administered, and interpreted by school psy¬ 
chologists, school counselors, regular and spe¬ 
cial educators, speech pathologists, and 
physical therapists, among other professionals. 
The validity of inferences will be enhanced if 
test users possess adequate knowledge of the 
principles of measurement and evaluation. 
However, this diverse group of test users may 
differ in their levels of technical expertise in 
measurement and degree of professional train¬ 
ing in assessment procedures. It is important 
that professional evaluators administer and 
interpret only those tests with which they 


have training and competence, in order to 
prevent misuse of tests. 

State and federal law generally requires 
that students who are referred for possible 
special education services be screened for eli¬ 
gibility. The screening or initial assessment 
may in turn call for a more comprehensive 
evaluation. But the large numbers of students 
to be tested, the high cost of special educa¬ 
tion programs, and the limits of time create 
pressures on special education assessment 
practices. Assessment usually must be com¬ 
pleted within a specific number of working 
days after referral, and, in most instances, the 
school district is responsible for funding spe¬ 
cial services recommended by the child study 
team. Occasionally, administrators might be 
inclined to use less expensive, less time-con¬ 
suming, or more readily available testing pro¬ 
cedures than a professional evaluator believes 
are warranted. An example would be the 
inappropriate use of available, but less ade¬ 
quately trained, staff to evaluate students. 
There also might be pressures to minimize 
or overlook problems that require expensive 
services. These conditions are likely to 
adversely affecc the validity of the interpreta¬ 
tion of test results. Adherence to professional 
standards governing test use in conducting 
special education assessments is important, in 
the face of pressures to use more expedient 
procedures. The responsible use of tests by 
school personnel can improve the opportuni¬ 
ties for promoting che development and 
learning of all children. 
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STANDARDS] 


Standard 13.1 

When educational testing programs are 
mandated by school, district, state, or 
other authorities, the ways in which test 
results are intended to be used should be 
clearly described. It is the responsibility 
of those who mandate the use of tests to 
monitor their impact and to identify and 
minimize potential negative consequences. 
Consequences resulting from the uses of 
the test, both intended and unintended, 
should also be examined by the test user. 

Comment: Mandated testing programs are 
often justified in terms of their potential 
benefits for teaching and learning. Concerns 
have been raised about the potential negative 
impact of mandated testing programs, par¬ 
ticularly when they result directly in impor¬ 
tant decisions for individuals or institutions. 
Frequent concerns include narrowing the 
curriculum to focus only on the objectives 
tested, increasing the number of dropouts 
among students who do not pass the test, 
or encouraging other instructional or 
administrative practices simply designed 
to raise test scores rather than to affect 
the quality of education. 

Standard 13.2 

In educational settings, when a test is 
designed or used to serve multiple purpos¬ 
es, evidence of the test’s technical quality 
should be provided for each purpose. 

Comment: In educational testing, it has 
become common practice to use che same 
test for multiple purposes (e.g., monitoring 
achievement of individual students, provid¬ 
ing information to assist in instructional 
planning for individuals or groups of stu¬ 
dents, evaluating schools or districts). No 
test will serve all purposes equally well. 
Choices in test development and evaluation 
that enhance validity for one purpose may 


diminish validity for other purposes. 
Different purposes require somewhat dif¬ 
ferent kinds of technical evidence, and 
appropriate evidence of technical quality for 
each purpose should be provided by the test 
developer. If the test user wishes to use the 
test for a purpose not supported by the 
available evidence, it is incumbent on the 
user to provide the necessary additional 
evidence (see chapter I). 

Standard 13.3 

When a test is used as an indicator of 
achievement in an instructional domain 
or with respect to specified curriculum 
standards, evidence of the extent to which 
the test samples the range of knowledge 
and elicits the processes reflected in the 
target domain should be provided. Both 
tested and target domains should be 
described in sufficient detail so their rela¬ 
tionship can be evaluated. The analyses 
should make explicit those aspects of the 
target domain that the test represents as well 
as those aspects that it fails to represent. 

Comment: Increasingly, tests are being devel¬ 
oped to monitor progress of individuals and 
groups toward local, state, or professional 
curriculum standards. Rarely can a single 
test cover the full range of performances 
reflected in the curriculum standards. To 
assure appropriate interpretations of test 
scores as indicators of performance on these 
standards, it is essential to document and 
evaluate both the relevance of the test to the 
standards and the extent to which the test 
represents the standards. When existing tests 
are selected by a school, district, or state to 
represent local curricula, it is incumbent on 
the user to provide the necessary evidence of 
the congruency of the curriculum domain 
and the test content. Further, conducting 
studies of the cognitive strategies and skills 
employed by test takers or studies of the 
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relationships between test scores and other 
performance indicators relevant to the broad¬ 
er domain enables evaluation of the extent to 
which generalizations to the broader domain 
are supported. This information should be 
made available to all those who use the test 
and interpret the test scores. 

Standard 13.4 

Local norms should be developed when 
necessary to support test users’ intended 
interpretations. 

Comment: Comparison of examinees' scores 
to local as well as more broadly representative 
norm groups can be informative. Thus, sam¬ 
ple size permitting, local norms are often use¬ 
ful in conjunction with published norms, 
especially if the local population differs 
markedly from the population on which pub¬ 
lished norms ate based. In some cases, local 
norms may be used exclusively. 

Standard 13.5 

When test results substantially contribute to 
making decisions about student promotion 
or graduation, there should be evidence that 
the test adequately covers only the specific 
or generalized content and skills that stu¬ 
dents have had an opportunity to learn. 

Comment: Students, patents, and educational 
staff should be informed of the domains on 
which the students will be tested, the nature 
of the item types, and the standards for mas¬ 
tery. Reasonable efforts should be made ro 
documenr the provision of instruction on 
tested content and skills, even though it may 
not be possible or feasible to determine rhe 
specific content of instruction for every sru- 
dent. Chapter 7 provides a more thorough 
discussion of the difficulties that arise with 
this conception of fairness in testing. 


Standard 13.6 

Students who must demonstrate mastery 
of certain skills or knowledge before being 
promoted or granted a diploma should have 
a reasonable number of opportunities to suc¬ 
ceed on equivalent forms of the test or be 
provided with construct-equivalent testing 
alternatives of equal difficulty to demon¬ 
strate the skills or knowledge. In most cir¬ 
cumstances, when students are provided 
with multiple opportunities to demonstrate 
mastery, the time interval between the 
opportunities should allow for students to 
have the opportunity to obtain the relevant 
instructional experiences. 

Comment: The number of opportunities and 
time between each testing opportunity will 
vary with the specific circumstances of the 
setting. Further, some students may benefit 
from a different testing approach to demon¬ 
strate cheir achievement. Care must be taken 
that evidence of construct equivalence of 
alternative approaches is provided as well as 
the equivalence of cut scores defining pass¬ 
ing expectations. 

Standard 13.7 

In educational settings, a decision or charac¬ 
terization that will have major impact on a 
student should not be made on the basis of 
a single test score. Other relevant informa¬ 
tion should be taken into account if it will 
enhance the overall validity of the decision. 

Comment. As an example, when the purpose 
of testing is to identify individuals with spe¬ 
cial needs, including students who would 
benefit from gifted and talented programs, 
a screening for eligibility or an initial assess¬ 
ment should be conducted. The scteening or 
initial assessment may in turn call for more 
comprehensive evaluation. The comprehen¬ 
sive assessment should involve the use of 
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multiple measures, ana data should be col¬ 
lected from multiple sources. Any assessment 
data used in making decisions are evaluated 
in terms of validity, reliability, and relevance 
to the specific needs of the students. It is 
important that in addition to test scores, 
other relevant information (e.g., school 
record, classroom observation, parent report) 
is taken into account by the professionals 
making the decision. 

Standard 13.8 

When an individual student’s scores from 
different tests are compared, any educational 
decision based on this comparison should 
take into account the extent of overlap 
between the two constructs and the reliabili¬ 
ty or standard error of the difference score. 

Comment: When difference scores between 
two tests are used to aid in making educa¬ 
tional decisions, it is important that the two 
tescs are standardized and, if appropriate, 
normed on the same population at about the 
same time. In addition, the reliability and 
standard error of the difference scores 
between the two tests are affected by the 
relationship between the constructs meas¬ 
ured by the tescs as well as the standard 
errors of measurement of the scores of the 
two tests. In the case of comparing ability 
with achievement test scores, the overlapping 
nature of the two constructs may render the 
reliability of the difference scores lower than 
test users normally would assume. If the abili¬ 
ty and/or achievement tests involve a signifi¬ 
cant amount of measurement error, this will 
also reduce die confidence one may place on 
the difference scores. All these factors affect 
the reliability of difference scores between 
tests and should be considered by professional 
evaluators in using difference scores as a basis 
for making important decisions about a stu¬ 
dent. This standard is also relevant when 
comparing scores from different components 


of the same test such as multiple aptitude test 
batteries and selection tests. 

Standard 13.9 

When test scores are intended to be used as 
part of the process for making decisions for 
educational placement, promotion, or 
implementation of prescribed educational 
plans, empirical evidence documenting the 
relationship among particular test scores, the 
instructional programs, and desired student 
outcomes should be provided. When ade¬ 
quate empirical evidence is not available, 
users should be cautioned to weigh the test 
results accordingly in light of other relevant 
information about the student. 

Comment: The validity of test scores for 
placement or promotion decisions rests, in 
part, upon evidence about whether students, 
in fact, benefit from the differential instruc¬ 
tion. Similarly, in special education, when 
test scores are used in the development of 
specific educational objectives and instruc¬ 
tional strategies, evidence is needed to show 
that the prescribed instruction enhances stu¬ 
dents’ learning. When there is limiced evi¬ 
dence about the relationship among test 
results, instructional plans, and student 
achievement outcomes, test developers and 
users should stress the tentative nature of the 
test-based recommendations and encourage 
teachers and other decision makers to consider 
the usefulness of test scores in light of other 
relevant information about the students. 

Standard 13.10 

Those responsible for educational testing pro¬ 
grams should ensure that the individuals who 
administer and score the test(s) axe proficient 
in the appropriate test administration proce¬ 
dures and scoring procedures and that they 
understand the importance of adhering to the 
directions provided by the test developer. 
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Standard 13.11 

In educational settings, test users should 
ensure that any test preparation activities 
and materials provided to students will not 
adversely affect the validity of test score 
inferences. 

Comment: In most educational testing 
contexts, the goal is to use a sample of test 
items to make inferences to a broader 
domain. When inappropriate test prepara¬ 
tion activities occur, such as teaching items 
that are equivalent to those on the test, the 
validity of test score inferences is adversely 
affected. The appropriateness of test prepa¬ 
ration activities and materials can be evalu¬ 
ated, for example, by determining the 
extent to which they reflect the specific test 
items and the extent to which test scores are 
artificially raised without actually increasing 
students’ level of achievement. 

Standard 13.12 

In educational settings, those who super¬ 
vise others in test selection, administration, 
and interpretation should have received 
education and training in testing necessary 
to ensure familiarity with the evidence for 
validity and reliability for tests used in the 
educational setting and to be prepared to 
articulate or to ensure that others articu¬ 
late a logical explanation of the relation¬ 
ship among the tests used, the purposes 
they serve, and the interpretations of the 
test scores. 

Standard 13.13 

Those responsible for educational testing 
programs should ensure that the individuals 
who interpret the test results to make deci¬ 
sions within the school context are qualified 
co do so or are assisted by and consult 
with persons who are so qualified. 


Comment: When testing programs are used 
as a strategy for guiding instruction, teach¬ 
ers expected to make inferences about 
instructional needs may need assistance in 
interpreting test results for this purpose. If 
the tests are normed locally, statewide, or 
nationally, teachers and administrators need 
to be proficient in interpreting the norm- 
referenced test scores. 

The interpretation of some test scores 
is sufficiently complex to require that the 
user have relevant psychological training 
and experience or be assisted by and consult 
with persons who have such training and 
experience. Examples of such tests include 
individually administered intelligence tests, 
personality inventories, projective techniques, 
and neuropsychological tests. 

Standard 13.14 

In educational settings, score reports 
should be accompanied by a clear state¬ 
ment of the degree of measurement error 
associated with each score or classification 
level and information on how to interpret 
the scores. 

Comment: This information should be com¬ 
municated in a way that is accessible to per¬ 
sons receiving the score report. For instance, 
the degree of uncertainty might be indicated 
by a likely range of scores or by the proba¬ 
bility of misclassification. 

Standard 13.15 

In educational settings, reports of group 
differences in test scores should be accom¬ 
panied by relevant contextual information, 
where possible, to enable meaningful 
interpretation of these differences. Where 
appropriate contextual information is not 
available, users should be cautioned 
against misinterpretation. 
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Comment: Observed differences in test scores 
between groups (e.g., classified by gender, race/ 
ethnicity, school/district, geographical region) 
can be influenced, for example, by differences 
in course-taking patterns, in curriculum, in 
teacher’s qualifications, or in parental educa¬ 
tional level. Differences in performance of 
cohorts of students across time may be influ¬ 
enced by changes in the population of students 
tested or changes in learning opportunities for 
students. Users should be advised to consider 
the appropriate contextual information and 
cautioned against misinterpretation. 

Standard 13.16 

In educational settings, whenever a test 
score is reported, the date of test adminis¬ 
tration should be reported. This informa¬ 
tion and the age of any norms used for 
interpretation should be considered by test 
users in making inferences. 

Comment: When a test score is used for a 
particular purpose, the date of the test score 
should be taken into consideration in deter¬ 
mining its worth or appropriateness for mak¬ 
ing inferences about a student. Depending 
on the particular domain measured, the 
validity of score inferences may be question¬ 
able as time progresses. For instance, a read¬ 
ing score from a test administered 6 months 
ago to an elementary school-aged student 
may no longer reflect the student’s current 
reading level. Thus, a test score should not 
be used if it has been determined that undue 
time has passed since the time of data collec¬ 
tion and that the score no longer can be con¬ 
sidered a valid indicator of a student’s current 
level of proficiency. 

Standard 13.17 

When change or gain scores are used, such 
scores should be defined and their technical 
qualities should be reported. 


Comment: The use of change or gain scores 
presumes the same test or equivalent forms 
of the test were used and that the test has 
(or the forms have) not been materially 
altered between administrations. The stan¬ 
dard error of the difference between scores 
on the pretest and posttest, the regression of 
posttest scores on pretest scores, or relevant 
data from other reliable methods for examin¬ 
ing change, such as those based on structural 
equation modeling, should be reported. 

Standard 13.18 

Documentation of design, models, scoring 
algorithms, and methods for scoring and 
classifying should be provided for tests 
administered and scored using multimedia 
or computers. Construct-irrelevant variance 
pertinent to computer-based testing and 
the use of other media in testing, such as 
the test taker’s familiarity with technology 
and the test format, should be addressed in 
their design and use. 

Comment: It is important to assure that the 
documentation does not jeopardize the secu¬ 
rity of the items that could adversely affect 
the validity of score interpretations. Computer 
and multimedia testing need to be held to 
the same requirements of technical quality 

Standard 13.19 

In educational settings, when average or 
summary scores for groups of students are 
reported, they should be supplemented 
with additional information about the 
sample size and shape or dispersion of 
score distributions. 

Comment: Score reports should be designed 
to communicate clearly and effectively to 
their intended audiences. In most cases, 
reports that go beyond average score compar¬ 
isons are helpful in furthering thoughtful use 
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and interpretation of test scores. Depending 
on the intended purpose and audience of the 
score report, additional information might 
take the form of standard deviations or other 
common measures of score variability, or of 
selected percentile points for each distribu¬ 
tion. Alternatively, benchmark score levels 
might be established and then, for each group 
or region, the proportions of test takers 
attaining each specified level could be 
repotted. Such benchmarks might be defined, 
for example, as selected percentiles of the 
pooled distribution for all groups or regions. 
Other distributional summaries of reporting 
formats may also be useful. The goal of more 
detailed reporting must be balanced against 
goals of clarity and conciseness in commu¬ 
nicating test scores. 
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Background 

Employment testing is carried out by organi¬ 
zations for purposes of employee selection, 
promotion, or placement. Selection generally 
refers to decisions about which individuals will 
enter the organization; placement refers to 
decisions as to how to assign individuals to 
positions within the work force; and promotion 
refers to decisions about which individuals with¬ 
in the organization will advance. What all three 
have in common is a focus on the prediction of 
future job behaviors, with the goal of influenc¬ 
ing organizational outcomes such as efficiency, 
growth, productivity, and employee motivation 
and satisfaction. 

Testing used in the processes of licensure 
and certification, which will here generically 
be called credentialing, focuses on the appli¬ 
cant’s current skill ot competency in a speci¬ 
fied domain. In many occupations, individuals 
must be licensed by governmental agencies in 
order to engage in the particular occupation. 
In other occupations, professional societies or 
other organizations assume responsibility for 
credentialing. Although licensure is typically 
a credential for entry into an occupation, cre¬ 
dentialing programs may exist at varying lev¬ 
els, from novice to expert in a given field. 
Certification is usually sought voluntarily, 
although occupations differ in the degree to 
which obtaining certification influences employ- 
ability or advancement. Testing is commonly 
only a part of a credentialing process, which 
may also include other requirements, such as 
education or supervised experiences. The 
Standards apply to the use of tests in the broad¬ 
er credentialing process. 

Testing is also carried out in work organ¬ 
izations for a variety of purposes orher than 
employment decision making and credentialing. 
Testing to detect psychopathology can take 
place, as in the case of an employee exhibiting 


behavioral problems at work. Testing as a tool 
for personal growth can be part of training 
and development programs, in which instru¬ 
ments measuring personality characteristics, 
interests, values, preferences, and work styles 
are commonly used with the goal of provid¬ 
ing self-insight to employees. Testing can also 
take place in the context of program evaluation, 
as in the case of an experimental study of the 
effectiveness of a training program, where tests 
may be administered as pre- and post-measures. 
The focus of this chapter, though, is on the use 
of testing in employment and credentialing. 
Many issues relevant to such testing are dis¬ 
cussed in other chapters; technical matters in 
chapters 1-6, fairness issues in chapters 7-10, 
general issues of test use in chapter 11, and 
individualized assessment of job candidates in 
chapter 12. 

Employment Testing 

The iNFiuENCE of Context on Test Use 

Employment testing involves using test 
information to aid in personnel decision making. 
Both the content and the context of employ¬ 
ment testing varies widely. Content may cover 
various domains of knowledge, skills, abilities, 
traits, dispositions, and values. The context in 
which tests are used also varies widely. Some 
contextual features represent choices made by 
the employing organization; others represent 
constraints that must be accommodated by the 
employing organization. Decisions about the 
design, evaluation, and implementation of a 
testing system are specific to the context in 
which the system is to be used. Important con¬ 
textual features include the following: 

Internal vs. external candidate pool. 
In some instances, such as promotional set¬ 
tings, the candidates to be tested are already 
employed by the organization. In others, 
applications are sought from outside the 
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organization. In others, a mix of internal and 
external candidates is sought. 

Untrained vs. specialized jobs. In some 
instances, untrained individuals are selected 
either because the job does not require spe¬ 
cialized knowledge or skill or because the organ¬ 
ization plans to offer training after the point 
of hire. In other instances, trained ot experi¬ 
enced workers are sought with the expecta¬ 
tion that they can immediately step into a 
specialized job. Thus, the same job may require 
very different selection systems depending on 
whether trained ot untrained individuals will 
be hired or promoted. 

Short-term vs. long-term focus. In some 
instances, the goal of the selection system is to 
predict performance immediately upon or 
shortly after hire. In other instances, the con¬ 
cern is with longer-term performance, as in the 
case of predictions as to whether candidates 
will successfully complete a multiyear overseas 
job assignment. Concerns about changing job 
tasks and job requirements also can lead to a 
focus on characteristics projected to be nec¬ 
essary for performance on the target job in 
the future, even if not a part of the job as 
currently constituted. 

Screen in vs. screen out. In some 
instances, the goal of the selection system is 
to screen in individuals who will perform well 
on one set of behavioral or outcome criteria 
of interest to the organization. In others, the 
goal is to screen out individuals for whom the 
risk of pathological, deviant, or criminal 
behavior on the job is deemed too high. A 
testing system well suited to one objective 
may be completely inappropriate for another. 
That an individual is evaluated as a low risk 
for engaging in pathological behavior does not 
imply a prediction that the individual will 
exhibit high levels of job performance. That a 
test is predictive of one criterion does not sup¬ 
port the inference of linkages to other criteria 
of interest as well. 

Mechanical vs. judgmental decision 
making. In some instances, test information 


is used in a mechanical, standardized fashion. 
This is the case when scores on a test battery 
are combined by formula and candidates are 
selected in strict top-down rank order, or when 
only candidates above specific cut scores are 
eligible to continue to subsequent stages of a 
selection system. In other instances, informa¬ 
tion from a test is judgmentaliy integrated with 
information from other tests and with nontest 
information to form an overall assessment of 
the candidate. 

Ongoing vs. one-time use of a test. 

In some instances, a test may be used for an 
extended period of time in an organization, 
permitting the accumulation of data and expe¬ 
rience about the test in that context. In other 
instances, concerns about test security are such 
that repeated use is infeasible, and a new test 
is requited for each test administration. For 
example, a work-sample test for lifeguards, 
requiring retrieving a mannequin from the 
bottom of a pool, is not compromised if candi¬ 
dates possess detailed knowledge of the test in 
advance. In contrast, a written job knowledge 
test may be severely compromised if some can¬ 
didates have access to the test in advance. The 
key question is whether advance knowledge of 
test content changes the constructs measured 
by the test. 

Fixed applicant pool vs. continuous Row. 
In some instances, an applicant pool can be 
assembled prior to beginning the selection 
process, as in the case of a policy that all can¬ 
didates applying before a specific date will be 
considered. In other cases, there is a continuous 
flow of applicants about whom employment 
decisions need to be made on an ongoing basis. 
A ranking of candidates is possible in the case 
of the fixed pool; in the case of a continuous 
flow, a decision may need to be made about 
each candidate independent of information 
about other candidates. 

Small vs. large sample size. Large sample 
sizes are sometimes available for jobs with 
many incumbents, in situations in which mul¬ 
tiple similar jobs can be pooled, or in situa- 
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cions in which organizations with similar jobs 
collaborate in selection system development. 
In other situations, sample sizes are small; at the 
extreme is the case of the single-incumbent 
job. Sample size affects the degree to which 
different lines of evidence can be drawn on in 
examining validity for the intended inference 
to be drawn from the test. For example, rely¬ 
ing on the local setting for empirical linkages 
between test and criterion scores is not techni¬ 
cally feasible with small sample sizes. 

Size of applicant pool, relative to the 
number of job openings. The size of an 
applicant pool can constrain the type of testing 
system that is feasible. For desirable jobs, very 
large numbers of candidates may vie for a small 
number of jobs. Under such scenarios, short 
screening tests may be used to reduce the pool 
to a size for which the administration of more 
time-consuming and expensive tests is practi¬ 
cable. Large applicant pools may also pose test 
security concerns, limiting the organization to 
testing methods that permit simultaneous test 
administration to all candidates. 

Thus, test use by employers is conditioned 
by contextual features such as those in the fore¬ 
going list. Knowledge of these features plays an 
important part in the professional judgment 
that will influence both the type of testing sys¬ 
tem that will be developed and the strategy that 
will be used to evaluate critically the validity of 
the inference(s) drawn using the testing system. 

The Validation Process in Employment Testing 

The fundamental inference to be drawn 
from test scores in most applications of test¬ 
ing in employment settings is one of predic¬ 
tion: the test user wishes to make an inference 
from test results to some future job behavior 
or job outcome. Even when the validation strat¬ 
egy used does not involve empirical predictor- 
criterion linkages, as in the case of reliance on 
validity evidence based on test content, there 
is an implied criterion. Thus, while different 
strategies of gathering evidence may be used, 
the inference to be supported is that scores on 


the test can be used to predict subsequent job 
behavior. The validation process in employment 
settings involves the gathering and evaluation 
of evidence relevant to sustaining or challeng¬ 
ing this inference. As detailed below, a variety 
of validation strategies can be used to support 
this inference. 

It thus follows that establishing this pre¬ 
dictive inference requires that attention be 
paid to two domains: that of the test (the 
predictor) and that of the job behavior or out¬ 
come of interest (the criterion). Evaluating che 
use of a test for an employment decision can 
be viewed as testing the hypothesis of a link¬ 
age between these domains. Operationally, there 
are many ways of testing this hypothesis. This 
is illustrated by the following diagram: 


predictor- 1 -criterion 

measure measure 



domain domain 


The diagram differentiates between a pre¬ 
dictor construct domain and a predictor meas¬ 
ure and between a criterion construct domain 
and a criterion measure. A predictor construct 
domain is defined by specifying the set of 
behaviors that will be included under a partic¬ 
ular construct label (e.g., verbal reasoning, 
typing speed, conscientiousness). Similarly, a 
criterion construct domain specifies the set of job 
behaviors or job outcomes that will be included 
under a particular construct label (e.g., per¬ 
formance of core job tasks, teamwork, atten¬ 
dance, sales volume, overall job performance). 
Predictor and criterion measures are attempts 
at operationalizing these domains. 
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The diagram enumerates a number of 
inferences commonly of interest. The first is 
the inference that scores on a predictor measure 
are related to scores on a criterion measure. 
This inference is tested through empirical 
examination of relationships between the two 
measures. The second and fourth are conceptu¬ 
ally similar, both examine the inference that an 
operational measure can be interpreted as rep¬ 
resenting an individual's standing on the con¬ 
struct domain of interest. Logical analysis, 
expert judgment, and convergence with or 
divergence from conceptually similar or differ¬ 
ent measures are among the forms of evidence 
that can be examined in testing these linkages. 
The third is the inference of a relationship 
between the predictor construct domain and 
the criterion construct domain. This linkage is 
established on the basis of theoretical and logi¬ 
cal analysis. It commonly draws on systematic 
evaluation of job content and expert judgment 
as to the individual characteristics linked to 
successful job performance. The fifth represents 
the linkage between the predictor measure and 
the criterion construct domain. 

Some predictor measures are designed 
explicitly as samples of the criterion construct 
domain of interest, and, thus, isomorphism 
between the measure and the construct domain 
constitutes direct evidence for linkage 5. 
Establishing linkage 5 in this fashion is the hall¬ 
mark of approaches that rely heavily on what 
these Standards refer to as “validity evidence 
based on test content,” referred to as content 
validity in prior conceptualizations of the valida¬ 
tion process. Tests in which candidates for life¬ 
guard positions perform rescue operations or in 
which candidates for word processor positions 
type and edit text exemplify this approach. 

A prerequisite to the use of a predictor 
measure for personnel selection is that the 
linkage becween the predictor measure and 
the criterion construct domain be established. 
As the diagram illustrates, there are multiple 
strategies for establishing this crucial linkage. 
One strategy is direct, via linkage 5; a second 


involves pairing linkage 1 and linkage 4; and a 
third involves pairing linkage 2 and linkage 3. 

When the test is designed as a sample of 
the criterion construct domain, this linkage can 
be established directly via linkage 5. Another 
strategy for linking a predictor measure and the 
criterion construct domain focuses on linkages 
1 and 4: pairing an empirical link between the 
predictor and criterion measures with evidence 
of the adequacy with which the criterion meas¬ 
ure represents the criterion construct domain. 
The empirical link between the predictor meas¬ 
ure and the criterion measure is part of what 
these Standards refer to as “validity evidence 
based on relationships to other variables,” 
referred to as criterion-related validity in prior 
conceptualizations of the validation process. 
The empirical link of the test and the criterion 
measure must be supplemented by evidence of 
the relevance of the criterion measure to the 
criterion construct domain to complete the 
linkage becween the test and the criterion con¬ 
struct domain. Evidence of the relevance of the 
criterion measure to the criterion construct 
domain is commonly based on job analysis, 
though in some cases the link between the 
domain and the measure is so direct that rele¬ 
vance is apparent without job analysis (e.g., 
when the criterion construct of interest is 
absenteeism or turnover). Note that this strate¬ 
gy does not necessarily rely on a well-developed 
predictor construct domain. Predictor measures 
such as empirically keyed biodata measures are 
constructed on the basis of empirical links 
between test icem responses and the criterion 
measure of interest. Such measures may, in 
some instances, be developed without a fully 
established a priori conception of the predictor 
construct domain; the basis for their use is the 
direct empirical link between test responses and 
a relevant criterion measure. 

Yet another strategy for linking predictor 
scores and the criterion construct domain 
focuses on pairing evidence of the adequacy 
with which the predictor measure represents 
the predictor construct domain (linkage 2) 
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with evidence of the linkage between the pre¬ 
dictor construct domain and the criterion con¬ 
struct domain (linkage 3). As noted above, 
there is no single direct route to establishing 
these linkages. They involve lines of evidence 
subsumed under "construct validity” in prior 
conceptualizations of the validation process. A 
combination of lines of evidence, such as 
expert judgment of the characteristics predic¬ 
tive of job success, inferences drawn from an 
analysis of critical incidenrs of effective and 
ineffective job performance, and interview and 
observation methods, may support inferences 
about the predictor constructs linked to rhe 
criterion construct domain. Measures of these 
predictor constructs may then be selected or 
developed, and the linkage between the predic¬ 
tor measure and the predictor construct domain 
can be established with various lines of evidence 
for linkage 2 discussed above. 

Thus multiple sources of data and multi¬ 
ple lines of evidence can be drawn on to evalu¬ 
ate the linkage between a predictor measure 
and the criterion construct domain of interest. 
There is not a single correct or even a preferred 
method of inquiry for establishing this linkage. 
Rather, the test user must consider the specifics 
of the testing situation and apply professional 
judgment in developing a strategy for testing 
the hypothesis of a linkage between the predic¬ 
tor measure and the criterion domain. 

For many testing applications, there is a 
considerable cumulative body of research that 
speaks to some, if not all, of the inferences dis¬ 
cussed above. A meta-analytic integration of 
this research can form an integral part of the 
strategy for linking test information to the 
construct domain of interest. The value of col¬ 
lecting local validation data varies with the 
magnitude, relevance, and consistency of 
research findings using similar predictor meas¬ 
ures and similar criterion construct domains 
for similar jobs. In some cases, a small and 
inconsistent cumulative research record may 
lead to a validation strategy that relies heavily 
on local data; in others, a large, consistent 


research base may make investing resources in 
additional local data collection unnecessary. 

Bases for Evaluating Test Use 

While a primary goal of employment test¬ 
ing is the accurate prediction of subsequent 
job behaviors or job outcomes, it is important 
to recognize that there are limits to the degree 
to which such criteria can be predicted. Perfect 
prediction is an unattainable goal. First, behav¬ 
ior in work settings is also influenced by a wide 
variety of organizational and extra-organiza¬ 
tional factors, including supervisor and peer 
coaching, formal and informal training, changes 
in job design, changes in organizational struc¬ 
tures and systems, and changing family respon¬ 
sibilities, among others. Second, behavior in 
work settings is influenced by a wide variety of 
individual characteristics, including knowledge, 
skills, abilities, personality, and work attitudes, 
among others. Thus any single characteristic 
will be only an imperfect predictor, and even 
complex selection systems focus on the set of 
consttucts deemed most critical for the job, 
rather than on all characteristics that can influ¬ 
ence job behavior. Third, some measurement 
error always occurs even in well-developed test 
and criterion measures. 

Thus, testing systems cannot be judged 
against a standard of perfect prediction but 
rather in terms of comparisons with available 
alternative selection methods. Professional 
judgment, informed by knowledge of the 
research literature about the degree of predic¬ 
tive accuracy relative to available alternatives, 
influences decisions about test use. 

Decisions about test use are often influ¬ 
enced by additional considerations including 
utility (i.e., cost-benefit) evaluation, value 
judgments about the relative importance of 
selecting for one criterion domain vs. others, 
concerns about applicant reactions to tesc con¬ 
tent and process, the availability and appro¬ 
priateness of alternative selection methods, 
statutory or regulatory requirements governing 
test use, and social issues such as workforce 
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diversity. Organizational values necessarily 
come into play in making decisions about test 
use; organizations with comparable evidence 
supporting an intended inference drawn from 
test scores may thus reach different conclusions 
about whether to use any particular test. 

Testing in Professional and 
Occupational Credentialing 

Tests are widely used in the credentialing of 
persons for many occupations and profes¬ 
sions. Licensing requirements are imposed by 
state and local governments to ensure that 
those licensed possess knowledge and skills in 
sufficient degree to perform important occu¬ 
pational activities safely and effectively. 
Certification plays a similar role in many 
occupations not regulated by governments and 
is often a necessary precursor to advancement 
in many occupations. Certification has also 
become widely used to indicate that a person 
has certain specific skills (e.g., operation of 
specialized auto repair equipment) or knowl¬ 
edge (e.g., estate planning), which may be only 
a part of their occupational duties. Licensure 
and certification, as well as registry and other 
warrants of expertise, will here generically be 
called credentialing. 

Tests used in credentialing are intended 
to provide the public, including employers 
and government agencies, with a dependable 
mechanism for identifying practitioners who 
have met particular standards. The standards 
are strict, but not so stringent as to unduly 
restrain the right of qualified individuals to 
offer their services to the public. Credentialing 
also serves to protect the profession by 
excluding persons who are deemed to be not 
qualified to do the work of the occupation. 
Qualifications for credentials typically include 
educational requirements, some amount of 
supervised experience, and other specific crite¬ 
ria, as well as attainment of a passing score on 
one or mote examinations. Tests ate used in 
credentialing in a broad spectrum of profes¬ 
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sions and occupations, including medicine, 
law, psychology, teaching, architecture, real 
estate, and cosmetology. In some of these, 
such as actuarial science, clinical neuropsy¬ 
chology, and medical specialties, tests ate also 
used to certify advanced levels of expertise. 
Relicensure or recertification is also required 
in some occupations and professions. 

Tests used in credentialing are designed 
to determine whether the essential knowledge 
and skills of a specified domain have been 
mastered by the candidate. The focus of per¬ 
formance standards is on levels of knowledge 
and performance necessary for safe and appro¬ 
priate practice. Test design generally starts with 
an adequate definition of the occupation or 
specialty, so that persons can be clearly identi¬ 
fied as engaging in the activity. Then, the 
nature and requirements of the occupation, in 
its current form, are delineated. Often, a 
thorough analysis is conducted of the work 
performed by people in the profession or 
occupation to document the tasks and abilities 
that are essential to practice. A wide variety of 
empirical approaches is used, including delin¬ 
eation, critical incidence techniques, job analy¬ 
sis, training needs assessments, or practice 
studies and surveys of practicing professionals. 
Panels of respected experts in the field often 
work in collaboration with qualified specialists 
in testing to define test specifications, includ¬ 
ing the knowledge and skills needed for safe, 
effective performance, and an appropriate way 
of assessing that performance. Forms of testing 
may include traditional multiple-choice tests, 
written essays, and oral examinations. More 
elaborate performance tasks, sometimes using 
computer-based simulation, are also used in 
assessing such practice components as, for 
example, patient diagnosis or treatment plan¬ 
ning. Hands-on performance tasks may also 
be used (e.g., operating a boom crane or fill¬ 
ing a tooth) while being observed by one or 
more examiners. 

Credentialing tests may cover a number of 
related but distinct areas. Designing the testing 
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program includes deciding what areas are to be 
covered, whether one or a series of tests is to 
be used, and how multiple test scores are to be 
combined to reach an overall decision. In some 
cases high scores on some tests are permitted 
to offset low scores on other tests, so that addi¬ 
tive combination is appropriate. In other cases, 
an acceptable performance level is required on 
each test in an examination series. 

Validation of ctedentialing tests depends 
mainly on content-related evidence, often in 
the form of judgments that the test adequately 
represents the content domain of the occupa¬ 
tion or specialty being considered. Such evi¬ 
dence may be supplemented with other forms 
of evidence external to the test. Criterion-relat¬ 
ed evidence is of limited applicability in licen¬ 
sure settings because criterion measures are 
generally not available for those who are not 
granted a license. 

Defining the minimum level of knowl¬ 
edge and skill required for licensure or certifi¬ 
cation is one of the most important and 
difficult tasks facing those responsible for cte- 
dentialing. Verifying the appropriateness of 
the cut score or scores on the tests is a critical 
element in validity. The validity of the infer¬ 
ence drawn from the test depends on whether 
the standard for passing makes a valid distinc¬ 
tion between adequate and inadequate per¬ 
formance. Often, panels of experts are used to 
specify the level of performance that should be 
required. Standards must be high enough to 
protect the public, as well as the practitioner, 
but nor so high as to be unreasonably limiting. 
Verifying the appropriateness of the cut score 
or scores on a test used for licensure or certifi¬ 
cation is a critical element of the validity of 

Legislative bodies sometimes attempt to 
legislate a cut score, such as a score of 70%. 
Arbitrary numerical specifications of cut scores 
are unhelpful for two reasons. First, without 
detailed information abouc the test, job 
requirements, and their relationship, sound 
standard setting is impossible. Second, without 


detailed information about the format of the 
test and the difficulty of items, such numerical 
specifications have little meaning. 

Tests for credenrialing need to be precise 
in the vicinity of the passing, or cut, score. 
They may not need to be precise for those 
who clearly pass or clearly fail. Sometimes a 
test used in credenrialing is designed to be pre¬ 
cise only in the vicinity of the cut score. 
Computer-based mastery tests may include a 
procedure to end the testing when a decision 
about the candidates performance can be 
clearly made or when a maximum time limit 
is reached. This may result in a shorter test for 
candidates whose performance clearly exceeds 
or falls far below the minimum performance 
required for a passing score. The test taker 
may be told only whether the decision was 
pass or fail. Because such mastery tests are not 
designed to indicate how badly the candidate 
failed, or how well the candidate passed, provid¬ 
ing scores that are much higher or lower than 
the cut score could be misleading. Nevertheless, 
candidates who fail are likely to profit from 
information about the areas in which their per¬ 
formance was especially weak. When feedback 
to candidates about how well or how poorly 
they performed is intended, precision through¬ 
out the score range is needed. 

Practice in professions and occupations 
often changes over time. Evolving legal restric¬ 
tions, progress in scientific fields, and refine¬ 
ments in techniques can result in a need for 
changes in test content. When change is sub¬ 
stantial, it becomes necessary to revise the defi¬ 
nition of the job, and the test content, to 
reflect changing circumstances. When major 
revisions are made in the test, the cut score 
that identifies required test performance is 
also reestablished. 

Because credenrialing is an ongoing 
process, with tests given on a regular sched¬ 
ule, new versions of the test are often needed. 
From a technical perspective, all versions of a 
test should be prepared to the same specifi¬ 
cations and represent the same content. 
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Alternate test forms should have comparable 
score scales so that scores can retain their 
meaning. Various methods of jointly calibrat¬ 
ing alternate forms can be used to assure that 
the standard for passing represents the same 
level of performance on all forms. It may be 
noted that release of past test forms may com¬ 
promise the quality of test form comparability. 

Some credentialing groups consider it 
necessary, as a practical matter, to adjust their 
criteria yearly in order to regulate the number 
of accredited candidates entering the profes¬ 
sion. This questionable procedure raises seri¬ 
ous problems for the technical quality of the 
test scores. Adjusting the cut score annually 
implies higher standards in some years than in 
others, which, although open and straight¬ 
forward, is difficult to justify on the grounds 
of quality of performance. Adjusting the score 
scale so that a certain number or proportion 
reach the passing score, while less obvious to the 
candidates, is technically inappropriate because 
it changes the meaning of the scores from 
year to year. Passing a credentialing examinar 
rion should signify that the candidate meets 
the knowledge and skill standards set by the 
credentialing body, independent of the avail¬ 
ability of work. 

Issues of cheating and test security are of 
special importance for testing practices in cre¬ 
dentialing. Issues of test security are covered 
in chapters 5 and 11. Issues of cheating by 
test takers are covered in chapter 8. Issues con¬ 
cerning the technical quality of tests are found 
in chapters 1-6, and issues of fairness in chap¬ 
ters 7-10. 


Standard 14.1 

Prior to development and implementation 
of an employment test, a clear statement 
of the objective of testing should be made. 
The subsequent validation effort should be 
designed to determine how well the objec¬ 
tive has been achieved. 

Comment: The objectives of employment 
tests can vary considerably. Some aim to 
screen out those least suited for the job in 
question, while others are designed to iden¬ 
tify those best suited for the job. Tests also 
vary in the aspects of job behavior they are 
intended to predict, which may include 
quantity or quality of work output, tenure, 
counterproductive behavior, and teamwork, 
among others. 

Standard 14.2 

When a test is used to predict a criterion, 
the decision to conduct local empirical 
studies of predictor-criterion relationships 
and interpretation of the results of local 
studies of predictor-criterion relationships 
should be grounded in knowledge of rele¬ 
vant research. 

Comment: The cumulative literature on the 
relationship between a particular type of 
prediccor and type of criterion may be suffi¬ 
ciently large and consistent to support the 
predictor-criterion relationship without addi¬ 
tional research. In some setcings, the cumula¬ 
tive research literature may be so substantial 
and so consistent that a dissimilar finding in 
a local study should be viewed with caution 
unless the local study is exceptionally sound. 
Local studies are of greatest value in settings 
where the cumulative research literature is 
sparse (e.g., due to the novelty of the predic¬ 
tor and/or criterion used), where the cumula¬ 
tive record is inconsistent, or where the 
cumulative literature does not include studies 
similar to the local setting (e.g., a test with a 
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large cumulative literature dealing exclusively 
with production jobs, and a local setting 
involving managerial jobs). 

Standard 14.3 

Reliance on local evidence of empirically 
determined predictor-criterion relationships 
as a validation strategy is contingent on a 
determination of technical feasibility. 

Comment: Meaningful evidence of predictor- 
criterion relationships is conditional on a 
number of features, including (a) the job 
being relatively stable, rather than in a period 
of rapid evolution; (b) the availability of a rel¬ 
evant and reliable criterion measure; (c) the 
availability of a sample reasonably represen¬ 
tative of the population of interest; and (d) 
an adequate sample size for estimating the 
strength of the predictor-criterion relationship. 

Standard 14.4 

When empirical evidence of predictor-crite¬ 
rion relationships is part of the pattern of 
evidence used to support test use, the criteri¬ 
on measure(s) used should reflect the criteri¬ 
on construct domain of interest to the 
organization. All criteria used should repre¬ 
sent important work behaviors or work out¬ 
puts, on the job or in job-relevant training, 
as indicated by an appropriate review of 
information about the job. 

Comment: When criteria are constructed to 
represent job activities or behaviors (e.g., 
supervisory ratings of subordinates on impor¬ 
tant job dimensions), systematic collection of 
information about the job informs the devel¬ 
opment of the criterion measures, though 
there is no clear choice among the many 
available job analysis methods. There is nor 
a clear need for job analysis to support criteri¬ 
on use when measures such as absenteeism or 
turnover are the criteria of interest. 


Standard 14.5 

Individuals conducting and interpreting 
empirical studies of predictor-criterion rela¬ 
tionships should identify contaminants and 
artifacts that may have influenced study 
findings, such as error of measurement, 
range restriction, and the effects of missing 
data. Evidence of the presence or absence 
of such features, and of actions taken to 
remove or control their influence, should be 
retained and made available as needed. 

Comment: Error of measurement in the criteri¬ 
on and restriction in the variability of predic¬ 
tor or criterion scores systematically reduce 
estimates of the relationship between predic¬ 
tor measures and the criterion construct 
domain, and procedures for correction for the 
effects of these artifacts are available. When 
these procedures are applied, both corrected 
and uncorrected values should be presented, 
along with the rationale for the correction pro¬ 
cedures chosen. Statistical significance tests for 
uncorrected correlations should not be used 
with corrected correlations. Other features to 
be considered include issues such as missing 
data for some variables for some individuals, 
decisions about the retention or removal of 
extreme data points, the effects of capitaliza¬ 
tion on chance in selecting predictors from a 
larger set on the basis of strength of predictor- 
criterion relationships, and the possibility of 
spurious prediccor-criterion relationships, as 
in the case of collecting criterion ratings from 
supervisors who know selection test scores. 

Standard 14.6 

Evidence of predictor-criterion relationships in 
a current local situation should not be inferred 
from a single previous validation study unless 
the previous study of the predictor-criterion 
relationship was done under favorable condi¬ 
tions (i.e., with a large sample size and a rele¬ 
vant criterion) and if the current situation 
corresponds closely to the previous situation. 
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Comment: Close correspondence means that 
the job requirements or underlying psycho¬ 
logical constructs are substantially the same 
(as is determined by a job analysis), and that 
the predictor is substantially the same. 

Standard 14.7 

If tests are to be used to make job classifica¬ 
tion decisions (e.g., the pattern of predictor 
scores will be used to make differential job 
assignments), evidence that scores are linked 
to different levels or likelihoods of success 
among jobs or job groups is needed. 

Standard 14.8 

Evidence of validity based on test content 
requires a thorough and explicit definition 
of the content domain of interest. For selec¬ 
tion, classification, and promotion, the char¬ 
acterization of the domain should be based 
on job analysis. 

Comment: In general, the job content 
domain should be described in terms of job 
tasks or worker knowledge, skills, abilities, 
and other personal characteristics chat arc 
clearly operationally defined so that they can 
be linked to test content, and for which job 
demands are not expected to change substan¬ 
tially over a specified period of time. 
Knowledge, skills, and abilities included 
in the content domain should be those the 
applicant should already possess when being 
considered for the job in question. 

Standard 14.9 

When evidence of validity based on test con¬ 
tent is a primary source of validity evidence 
in support of the use of a test in selection or 
promotion, a close link between test content 
and job content should be demonstrated. 

Comment: For example, if the tesc content 
samples job casks with considerable fidelity 


(e.g., actual job samples such as machine 
operation) or, in the judgment of experts, 
correctly simulates job task content (e.g., cer¬ 
tain assessment center exercises), or samples 
specific job knowledge required for successful 
job performance (e.g., information necessary 
to exhibit certain skills), then content-related 
evidence can be offered as the principal form 
of evidence of validity. If the link between the 
test content and the job content is not clear 
and direct, other lines of validity evidence 
take on greater importance. 

Standard 14.10 

When evidence of validity based on test con¬ 
tent is presented, the rationale for defining 
and describing a specific job content domain 
in a particular way (e.g., in terms of tasks to 
be performed or knowledge, skills, abilities, 
or other personal characteristics) should be 
stated clearly. 

Comment: When evidence of validity based 
on test content is presented for a job or £lass 
of jobs, the evidence should include a 
description of the major job characteristics 
that a test is meant to sample, including 
the relative frequency, importance, or criti¬ 
cality of the elements. 

Standard 14.11 

If evidence based on test content is a pri¬ 
mary source of validity evidence supporting 
the use of a test for selection into a particu¬ 
lar job, a similar inference should be made 
about the test in a new situation only if the 
critical job content factors are substantially 
the same (as is determined by a job analy¬ 
sis), the reading level of the test material 
does not exceed that appropriate for the 
new job, and there are no discernible fea¬ 
tures of the new situation that would sub¬ 
stantially change the original meaning of 
the test material. 
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Standard 14.12 

When the use of a given test for personnel 
selection relies on relationships between a 
predictor construct domain that the test rep¬ 
resents and a criterion construct domain, 
two links need to be established. First, there 
should be evidence for the relationship 
between the test and the predictor construct 
domain, and second, there should be evi¬ 
dence for the relationship between the pre¬ 
dictor construct domain and major factors 
of the criterion construct domain. 

Comment: There should be a clear conceptual 
rationale for these linkages. Both the predic¬ 
tor construct domain and the criterion con¬ 
struct domain to which it is to be linked 
should be defined carefully. There is no sin¬ 
gle route to establishing these linkages. 
Evidence in support of linkages between the 
two construct domains can include patterns 
of findings in the research literature and sys¬ 
tematic evaluation of job content to identify 
predictor constructs linked to the criterion . 
domain. The bases for judgments linking the 
predictor and criterion construct domains 
should be articulated. 

Standard 14.13 

When decision makers integrate informa¬ 
tion from multiple tests or integrate test 
and nontest information, the role played by 
each test in the decision process should be 
clearly explicated, and the use of each test 
or test composite should be supported by 
validity evidence. 

Comment: A decision maker may integrate 
test scores with interview data, reference 
checks, and many other sources of informa¬ 
tion in making employment decisions. The 
inferences drawn from tesc scores should be 
limited to those for which validity evidence 
is available. For example, viewing a high test 
score as indicating overall job suitability, and 


thus precluding the need for reference checks, 
would be an inappropriate inference from a 
test measuring a single narrow, albeit relevant, 
domain, such as job knowledge. In other cir¬ 
cumstances, decision makers integrate scores 
across multiple tests, or across multiple scales 
within a given test. 

Standard 14.14 

The content domain to be covered by a cre- 
dentialing test should be defined clearly and 
justified in terms of the importance of the 
content for credential-worthy performance 
in an occupation or profession. A rationale 
should be provided to support a claim that 
the knowledge or skills being assessed are 
required for credential-worthy performance 
in an occupation and are consistent with the 
purpose for which the licensing or certifica¬ 
tion program was instituted. 

Comment: Some form of job or practice 
analysis provides the primary basis for defin¬ 
ing the content domain. If the same examina¬ 
tion is used in the licensure or certification of 
people employed in a variety of settings and 
specialties, a number of different job settings 
may need to be analyzed. Although the job 
analysis techniques may be similar to those 
used in employment testing, the emphasis for 
licensure is limited appropriately to knowl¬ 
edge and skills necessary for effective practice. 
The knowledge and skills contained in a core 
curriculum designed to train people for the 
job or occupation may be relevant, especially 
if the curriculum has been designed to be 
consistent with empirical job or practice 
analyses. In tests used for licensure, skills 
that may be important to success but are not 
directly related to the purpose of licensure 
(e.g., protecting the public) should not be 
included. For example, in real estate, market¬ 
ing skills may be important for success as a 
broker, and assessment of these skills might 
have utility for agencies selecting brokers for 
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employment. However, lack of these skills 
may not present a threat to the public and 
would appropriately be excluded from con¬ 
sideration for a licensing examination. The 
fact that successful practitioners possess cer¬ 
tain knowledge or skills is relevant but not 
persuasive. Such information needs to be 
coupled with an analysis of the purpose of 
a licensing program and the reasons that 
the knowledge or skill is required in an 
occupation or profession. 

Standard 14.15 

Estimates of the reliability of test-based cre- 
dentialing decisions should be provided. 

Comment: The standards for decision reliabili¬ 
ty described in chapter 2 are applicable to 
tests used for licensure and certification. 
Other types of reliability estimates and asso¬ 
ciated standard errors of measurement may 
also be useful, but the reliability of the deci¬ 
sion of whether or not to certify is of pri¬ 
mary importance. 

Standard 14.16 

Rules and procedures used to combine 
scores on multiple assessments to determine 
the overall outcome of a credentialing test 
should be reported to test lakers, preferably 
before the test is administered. 

Comment: In some cases, candidates may be 
required to score above a specified minimum 
on each of several tests. In other cases, the 
pass-fail decision may be based solely on a 
total composite score. While candidates may 
be told that tests will be combined into a 
composite, the specific weights given to 
various components may not be known in 
advance (e.g., to achieve equal effective 
weights, nominal weights will depend on 
the variance of the components). 


Standard 14.17 

The level of performance required for pass¬ 
ing a credentialing test should depend on 
the knowledge and skills necessary for 
acceptable performance in the occupation 
or profession and should not be adjusted 
to regulate the number or proportion of 
persons passing the test. 

Comment: The number or proportion of 
persons granted credentials should be adjust¬ 
ed, if necessary, on some basis other than 
modifications to either the passing score or 
the passing level. The cut score should be 
determined by a careful analysis and judg¬ 
ment of acceptable performance. When 
there are alternate forms of the test, the cut 
score should be carefully equated so that it 
has the same meaning for all forms. 
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Background 

Tests are widely used in program evaluation 
and in public policy decision making. Program 
evaluation is the set of procedures used to make 
judgments about the clients need for a program, 
the way it is implemented, its effectiveness, 
and its value. Policy studies are somewhat 
broader than program evaluations and refer to 
studies that contribute to judgments about 
plans, principles, or procedures enacted to 
achieve broad public goals. There is no sharp 
distinction between policy studies and program 
evaluations, and in many instances there is 
substantial overlap between the two types of 
investigations. Test results are often one impor¬ 
tant source of evidence for the initiation, 
continuation, modification, termination, or 
expansion of various programs and policies. 

Interpretation of test scores in program 
evaluation and policy studies usually entails the 
complex analysis of a number of variables. For 
example, some programs are mandated for a 
broad population; others target only certain 
subgroups. Some are designed to affect atti¬ 
tudes, while others are intended to have a 
more direct impact on behavior. It is important 
that the participants included in any study at 
least meet the specified criteria for the program 
or policy under review so that appropriate 
interpretation of test results will be possible. 
Test results will reflect not only the effects of 
rules for participant selection and the impact 
of participation in different programs or treat¬ 
ments, but also the characteristics of those test¬ 
ed. Relevant background information about 
clients or srudents may be obtained in order to 
strengthen the inferences derived from the test 
results. Valid interpretations may depend upon 
additional considerations that have nothing 
to do with the appropriateness of the test or 
its technical quality, including study design, 
administrative feasibility, and the quality of 


other available data. It is not the intent of this 
chapter to deal with these varied considerations 
in any substantial way. In order to develop 
defensible conclusions, however, investigators 
conducting program evaluations and policy 
studies are encouraged to supplement test 
results with data from other sources. These 
include information about program charac¬ 
teristics, delivery, costs, client backgrounds, 
degree of participation, and evidence of side 
effects. Because test results lend important 
weight to evaluation and policy studies, it is 
critical that any tests used in these investiga¬ 
tions be sensitive to the questions of the study 
and appropriate for the test takers. 

It is important to evaluate any proposed 
test in terms of its relevance to the goals of the 
program or policy and/or to the particular 
question its use will address. It is relatively rare 
for a test to be designed specifically for pro¬ 
gram evaluation or policy study purposes. 
Typically, the instruments used in such studies 
were originally developed for purposes other 
than program or policy evaluation. In addi¬ 
tion, because of cost or convenience, certain 
tests may be adopted for use in a program 
evaluation or policy study even though they 
may have been developed for a somewhat dif¬ 
ferent population of respondents. Some tests 
may be selected for use in program evaluation 
or policy studies because the tests ate well 
known and thought to be especially credible 
to the clients or the public consumer. Even 
though certain tests may be more familiar to 
the public or may be less time-consuming or 
less expensive to use than an instrument devel¬ 
oped specifically for the evaluation, they may 
be nonetheless inappropriate for use as criteri¬ 
on measures to determine the need for or to 
evaluate the effects of particular interventions. 

As government agencies and other institu¬ 
tions move to improve their own routine data 
collection capability, fewer special studies are 
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conducted to evaluate programs and policies. 
Instead, evaluations and policy studies may 
depend upon a special analysis of data previous¬ 
ly collected for other purposes. In these cases, 
the investigators may reanalyze test data already 
obtained and analyzed for another purpose in 
order ro make inferences about program or 
policy effectiveness. This procedure is called 
secondary data analysis. In some circumstances, 
it may be difficult to assure a good match 
between the existing test and the intervention 
or the policy under examination. Moreover, it 
may be difficult ro reconstruct in detail the 
conditions under which the data were originally 
collected. Secondary data analysis also requires 
consideration of whether adequate informed 
consent was obtained from subjects in the 
original data collection to allow secondary 
analysis to occur without obtaining additional 
consent. In selecting (or developing) a test or 
in deciding to use existing data in evaluation 
and policy studies, careful investigators attempt 
to balance the purpose of the test, its likeli¬ 
hood to be sensitive to the intervention under 
study, the credibility of the test to interested 
parties, and the costs of its administration. 
Otherwise, test results may lead to inappropri¬ 
ate interpretations about the progress, impact, 
and overall value of programs and policies 
under review. 

Program Evaluation 

Tests may be used in program evaluations to 
provide information on the status of clients or 
students before, during, or following an inter¬ 
vention, as well as to provide information on 
appropriate comparison groups. Whereas 
understanding the performance of an individ¬ 
ual student or client is often the goal of many 
testing activities, program evaluation targets 
the performance of, or impact on, groups. 
Tests are used in program evaluations in a vari¬ 
ety of fields, such as social services, education, 
health services, and military and employment 
training. The term program, broadly interpret¬ 


ed, describes interventions that range from 
large-scale state or national programs with pro¬ 
visions for local flexibility to small-scale, more 
experimental projects. In many cases, evaluation 
is mandated by the agency or funding source 
for the program, and the intervention is evalu¬ 
ated by judging its effectiveness in meeting 
stated goals. Some examples of programs that 
might use test results as part of their evaluation 
data include psychotherapeutic services, military 
training programs and job placement programs, 
school curricula, or services for individuals with 
special needs. 

Test results, along with other information, 
may be used to compare competing interven¬ 
tions, such as alternative reading curricula or 
different psychotherapeutic interventions, or to 
describe the long-term pattern of effects for 
one or more groups. It is often important to 
assess a program for its differential effectiveness 
in meeting the needs of subgroups (such as dif¬ 
ferent ethnic or gender groups within the tar¬ 
get population). Even though the performance 
of groups is of primary interest in program 
evaluation, the analysis of individuals’ histories 
and test performances may provide additional 
useful information to aid in the interpretation 
of test results. 

Because of administrative realities, such as 
cost constraints and response burden, method¬ 
ological refinements may be adopted to 
increase the efficiency of testing. One strategy 
is to obtain a sample of participants to be eval¬ 
uated from the larger set of those exposed to a 
program or policy. When there is a sufficient 
number of clients affected by the program or 
policy to be evaluated, and when there is a 
desire to limit the time spent on testing, evalu¬ 
ators can create multiple forms of shorter tests 
from a larger pool of items. By constructing a 
number of different test forms consisting of 
relatively few items and assigning these test 
forms to different subsamples of test takers (a 
procedure known as matrix sampling), a larger 
number of items can be included in the study 
than could reasonably be administered to any 
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single test taker. When it is desirable to repre¬ 
sent a domain with a large number of test 
items, this approach is often used. However, 
individual scores are not usually created or 
interpreted when matrix sampling is employed. 
Because procedures for sampling individuals or 
test items may vary in a number of ways, ade¬ 
quate analysis and interpretarion of test results 
for any study depend upon a dear description 
of how samples were formed and the manner 
in which test results were aggregated. 

Policy Uses of Tests 

As noted previously, tests are also used in poli¬ 
cy analyses, and the distinction between pro¬ 
gram evaluation and policy uses of tests is 
often a matter of degree. Programs are expect¬ 
ed to share particular goals, procedures, and 
resources. Policy is a broader term, applying 
to plans, principles, procedures, or programs 
enacted to achieve particular goals in different 
settings. Programs provide direct services or 
interventions. Policies may be constructed to 
achieve dieir goals by direct or indirect means. 
Indeed, one direct approach used to achieve a 
policy goal might include the funding of spe¬ 
cific programs. Other examples of direct policy 
approaches might involve the provision of 
training resources to improve performance in 
particular health-service occupations, or the 
enactment of new recertification requirements 
for accountants. Studies of the need for or 
impact of both of these policies could in part 
depend upon the analyses of test results. To 
illustrate in more depth, to meet the general 
policy objective of containing the costs of 
health care, direct policies might include giv¬ 
ing incentives to clients to participate in fitness 
programs and che development of patient 
education programs. Tests could measure the 
understandings and attitudes of participants 
about the relationship of fitness to the preven¬ 
tion of illness. Another policy example, using 
a more indirect approach, is to encourage edu¬ 
cators to create more effective programs for 


children from low-income families. A.s an 
approach, a states educational authorities 
might require the separate reporting of test 
scores for children in high-poverty areas. 
Large differences in group performance would 
be expected to attract the attention of the pub¬ 
lic and to place greater pressure on che schools 
to improve the performance of particular 
groups of children. 

In decentralized governments, policy 
implementation may be left to local authorities 
and may be interpreted in a number of differ¬ 
ent ways. As a result, it may be difficult to 
select or develop a single test or outcome 
measure that will be sensitive to the range of 
different activities or tactics used to implement 
a given policy. For that reason, policy studies 
may often use more than one test or outcome 
measure to provide a more adequate picture 
of the range of effeccs. 

Issues in Program and Policy 
Evaluation 

Test results are sometimes used as one way to 
inspire program administrators as well as to 
infer institutional effectiveness. This use of 
tests, including the public reporting of results, 
is thought to encourage an institution to 
improve its services for its clients. For example, 
consistently poor achievement test results may 
trigger special management attention for pub¬ 
lic schools in some locales. The interpretation 
of test results is especially complex when tests 
are used both as an institutional policy mecha¬ 
nism and as a measure of effectiveness. For 
example, a policy or program may be based on 
che assumption that providing clear goals and 
general specifications of test content (such as 
the type of topics, constructs and cognitive 
domains, and responses included in the test) 
may be a reasonable strategy to communicate 
new expectations to educators. Yet, the desire 
to influence test or evaluation results to show 
acceptable insdtutional performance could lead 
co inappropriate testing practices, such as 
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teaching the test items in advance, modifying 
test administration procedures, discouraging 
certain students or clients from participating 
in the testing sessions, or focusing exclusively 
on rest-taking procedures. These practices 
might occur instead of those aimed at helping 
the test taker learn the domains measured by 
the resr. Because results derived from such 
practices might lead to spuriously high esti¬ 
mates of impact and might reflect the negative 
side effects of this particular policy, diligent 
investigators may estimate the impact of such 
consequences in order to interpret the test 
results appropriately. Looking at possible inap¬ 
propriate consequences of tests as well as their 
benefits will better assess policy claims that 
particular types of testing programs lead to 
improved performance. 

On the other hand, policy studies and 
program evaluations often do not make avail¬ 
able reports of results to the test takers and 
may give no clear reasons to the test taker for 
participating in the testing procedure. For 
example, when matrix sampling is used for 
program evaluation, it may not be feasible to 
provide such reports. If little effort is made to 
motivate the test taker to regard the test seri¬ 
ously (for instance, if the purpose of the test is 
not explained to che test taker), it is possible 
that test cakers might have little reason to try 
to perform well on the test. Obtained test 
results then might well underrepresent the 
impact of the program, institution, or policy 
because of poor motivation on the part of the 
test taker. When there is a suspicion that the 
test might not have been taken seriously, moti¬ 
vation of test takers may be explored by 
collecting additional information, using 
observation or interview methods. The issues 
of inappropriate preparation or unmotivated 
performance are examples that raise basic ques¬ 
tions about the validity of interpretations of 
test results. In every case, it is important to 
consider the potential impact of the testing 
process itself, including test administration 
and reporting practices, on the test taker. 
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Public policy decisions are rarely based 
solely on the results of empirical studies, even 
when the studies have been well done. The 
more expansive and indirect the policy, che 
more likely will it be that other considerations 
will come into play, such as the political and 
economic impact of abandoning, changing, or 
retaining the policy, or the reaction to offering 
rewards or sanctions to institutions. In a politi¬ 
cal climate, tests used in policy settings may be 
subjected to intense and detailed scrutiny. 
When results do not support a favored posi¬ 
tion, attempts may be made to discount the 
appropriateness of the testing procedure, con¬ 
struct, or interpretation. 

It is important that all tests used in pub¬ 
lic evaluation or policy contexts meet the 
standards described in earlier chapters. As 
described in chapter 8, tests are to be adminis¬ 
tered by trained personnel. It is also essential 
that assistance be provided to those responsible 
for interpreting study results to practitioners, 
to the lay public, and to the media. Careful 
communication of the study’s goals, proce¬ 
dures, findings, and limitations increases che 
chances that the public’s interpretations will 
be accurate and useful. 

Additional Considerations 

This chapter and its associated standards are 
directed to users of cests in program evaluation 
and policy studies and to the conditions under 
which those studies are usually conducted. 
Other standards documents that are relevant to 
this chapter include The Program Evaluation 
Standards: How to Assess Evaluations of 
Educational Programs, prepared by the Joint 
Committee on Standards for Educational 
Evaluation (2nd ed., Thousand Oaks, CA: 
Sage Publications, 1994), and the Code of Fair 
Testing Practices in Education, prepared by the 
Joint Committee on Testing Practices 
(Washington, DC: Joint Committee on 
Testing Practices, 1988). 
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standards! 


Standard 15.1 

When the same test is designed or used 
to serve multiple purposes, evidence of 
technical quality' for each purpose should 
be provided. 

Comment: In educational testing, for example, 
it has become common practice to use the 
same test for multiple purposes (e.g., moni¬ 
toring achievement of individual students, 
providing information to assist in instruction¬ 
al planning for individuals or groups of stu¬ 
dents, evaluating schools or districts). No test 
will serve all purposes equally well. Choices in 
test development and evaluation that enhance 
validity for one purpose may diminish validi¬ 
ty for other purposes. Different purposes 
require somewhat different kinds of technical 
evidence, and appropriate evidence of techni¬ 
cal quality for each purpose should be provid¬ 
ed by the test developer. If the test user 
wishes to use the test for a purpose not sup¬ 
ported by the available evidence, it is incum¬ 
bent on the user to provide the necessary 
additional evidence. 

Standard 15.2 

Evidence should be provided of the suitabili¬ 
ty of a test for use in evaluation or policy 
studies, including the relevance of the test to 
the goals of the program or policy under 
study and the suitability of the test for the 
populations involved. 

Comment: Faulty inferences may be made 
when test scores are not sensitive to the 
features of a particular intervention. For 
instance, a test designed for selection may be 
ineffective as a measure of the effects of an 
intervention. It is also important to employ 
tests that are appropriate for the age and 
background of tesc takers. 


Standard 15.3 

When change or gain scores are used, the 
definition of such scores should be made 
explicit, and their technical qualities should 
be reported. 

Comment: The use of change or gain scores 
presumes that the same test or equivalent 
forms of the test were used and that the test 
(or forms) have not been materially altered 
between administrations. The standard error 
of the difference between scores on pretests 
and posttests, the regression of posttest 
scores on pretest scores, or relevant data 
from other reliable methods for examining 
change, such as those based on structural 
equation modeling, should be reported. 

Standard 15.4 

In program evaluation or policy studies, 
investigators should complement test 
results with information from other 
sources to generate defensible conclu¬ 
sions based on the interpretation of test 

Comment: Descriptions or analyses of such 
variables as client selection criteria, services, 
clients, setting, and resources are often 
needed to provide a comprehensive picture 
of the program or policy under review and 
to aid in the interpretation of cest results. 
Performance on indicators other than tests 
is almost always useful and in many cases 
is essential. Examples of other information 
include attrition rates or patterns of partici¬ 
pation. Another source of information 
might be to determine the degree of moti¬ 
vation of the test takers. When individual 
scores are not reported to test takers, it is 
important to determine whether the exam¬ 
inees took the test experience seriously. 
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Standard 15.5 

Agencies using tests to conduct program 
evaluations or policy studies, or to monitor 
outcomes, should clearly describe the popu¬ 
lation the program or policy is intended to 
serve and should document the extent to 
which the sample of test takers is represen¬ 
tative of that population. 

Comment: For example, a clinic with a diverse 
client population using testing to assess the 
outcome of a particular treatment may rou¬ 
tinely report the extent of participation by 
subgroups of clients, for instance, those of 
diverse ethnic backgrounds or for whom 
English is a second language. 

Standard 15.6 

When matrix sampling procedures are used 
for program evaluation or population 
descriptions, rules for sampling items and 
test takers should be provided, and reliabili¬ 
ty analyses must take the sampling scheme 
into account. 

Standard 15.7 

When educational testing programs are 
mandated by school, district, state, or other 
authorities, the ways in which test results 
are intended to be used should be clearly 
described. It is the responsibility of those 
who mandate the use of tests to identify 
and monitor their impact and to mini¬ 
mize potential negative consequences. 
Consequences resulting from the uses of 
the test, both intended and unintended, 
should also be examined by the test user. 

Comment: Mandated testing programs are 
often justified in terms of their potential 
benefits for teaching and learning. Concerns 
have been raised about the potential negative 
impact of mandated testing programs, par¬ 
ticularly when they affect important deci¬ 


sions for individuals or institutions. To the 
extent possible, students, parents, and scafF 
should be informed of the domains on 
which the students will be tested, the nature 
of the item types, and the standards for mas¬ 
tery. Effort should be made to document the 
provision of instruction in tested content 
and skills, even though it may not be possi¬ 
ble or feasible to determine the specific con¬ 
tent of instruction for every student. An 
example of negative impact is the use of 
strategies to raise performance artificially. 

Standard 15.8 

When it is clearly stated or implied that a 
recommended test use will result in a specif¬ 
ic outcome, the basis for expecting that out¬ 
come should be presented, together with 
relevant evidence. 

Comment: A given claim for the benefits of 
test use, such as improving students’ achieve¬ 
ment, may be supported by logical or theoreti¬ 
cal argument as well as empirical data. Due 
weight should be given to findings in the sci¬ 
entific literature that may be inconsistent 
with the stated claim. 

Standard 15.9 

The integrity of test results should be main¬ 
tained by eliminating practices designed to 
raise test scores without improving perform¬ 
ance on the construct or domain measured 

Comment: Such practices may include teach¬ 
ing test items in advance, modifying test 
administration procedures, and discouraging 
or excluding certain test takers from taking 
the test. These practices can lead to spuri¬ 
ously high scores that do not reflect per¬ 
formance on the underlying construct or 
domain of interest. 
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Standard 15.10 

Those who have a legitimate interest in an 
assessment should be informed about the 
purposes of testing, how tests will be admin¬ 
istered and scored, how long records will be 
retained, and to whom and under what con¬ 
ditions the records may be released. 

Comment: Those with a legitimate interest 
may include the test takers, their parents or 
guardians, or personnel who may be affected 
by results (teachers, program staff). 

Standard 15.11 

When test results are released to the public 
or to policymakers, those responsible for 
the release should provide and explain any 
supplemental information that will mini¬ 
mize possible misinterpretations of the data. 

Comment: The context and limitations of 
the study should be described, with parti¬ 
cular attention given to methods of causal 
inferences. 

Standard 15.12 

Reports of group differences in average test 
scores should be accompanied by relevant 
contextual information, where possible, to 
enable meaningful interpretation of these 
differences. Where appropriate contextual 
information is not available, users should 
be cautioned against misinterpretation. 

Comment: Observed differences in average 
rest scores between groups (e.g„ classified by 
gender, race/ethnicity, or geographical region) 
can be influenced, for example, by differences 
in life experiences, training experience, effort, 
instructor quality, or level and type of 
parental support. In education, differences in 
group performance across time may be influ¬ 
enced by changes in the population of those 
tested or changes in their experiences. Users 


should be advised to consider the appropriate 
contextual information and be cautioned 
against misinterpretation. 

Standard 15.13 

Those who mandate testing programs 
should ensure that the individuals who 
interpret the test results to make decisions 
within the school or program context are 
qualified to assume this responsibility and 
proficient in the appropriate methods for 
interpreting test results. 

Comment: When testing programs are used 
as a strategy for guiding interventions or 
instruction, professionals expected to make 
inferences leading to program improvement 
may need assistance in interpreting test 
results for this purpose. 

The interpretation of some test scores is 
sufficiently complex to require that the user 
have relevant psychological training and expe¬ 
rience. Examples of such tests include indi¬ 
vidually administered intelligence tests, 
personality inventories, projective techniques, 
and neuropsychological tests. 
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This glossary provides definitions of terms as 
used in this text. For many of the terms, mul¬ 
tiple definitions can be found in the litera¬ 
ture; also, technical usage may differ from 
common usage. 


ability/trait parameter In item response 
theory (IRT), a theoretical value indicating 
the level of a test taker on the ability or trait 
measured by the test; analogous to the con¬ 
cept of true score in classical test theory'. 

ability testing The use of standardized tests 
to evaluate the current performance of a 
person in some defined domain of cognitive, 
psychomotor, or physical functioning. 

absolute score interpretation The meaning 
of a test score for an individual or an average 
score for a defined group, indicating an indi¬ 
vidual’s or groups level of performance in 
some defined criterion domain. By contrast, 
see relative score interpretation. 

accommodation See test modification. 

acculturation The process whereby individ¬ 
uals from one culture adopt the characteris¬ 
tics and values of another culture with which 
they have come in contact. 

achievement levels/proficiency levels 
Descriptions of a test taker’s competency in a 
particular area of knowledge or skill, usually 
defined as ordered categories on a continu¬ 
um, often labeled from “basic” to “advanced,” 
or "novice" to “expert,” that constitute broad 
ranges for classifying performance. See cut score. 

achievement testing A test to evaluate the 
extent of knowledge or skill attained by a test 
taker in a content domain in which the test 
taker had received instruction. 


adaptive testing A sequential form of indi¬ 
vidual testing in which successive items, or 
sets of items, in the test are chosen based 
primarily on their psychometric properties 
and content, in relation to the test taker’s 
responses to previous items. 

adjusted validity/reliability coefficient A 
validity or reliability coefficient—most often, 
a product-moment correlation—that has been 
adjusted to offset the effects of differences in 
score variability, criterion variability, or the 
unreliability of test and/or criterion. See 
restriction of range or variability. 

age equivalent The chronological age in a 
defined population for which a given score is 
the median (middle) score. Thus, if children 
10 years and 6 months of age have a median 
score of 17 on a test, the score 17 is said to 
have an age equivalent of 10-6 for that 
population. See grade equivalent. 

alternate forms Two or more versions of a 
test that are considered interchangeable, in 
that they measure the same constructs in the 
same ways, are intended for the same purpos¬ 
es, and are administered using the same direc¬ 
tions. Alternate forms is a generic term used to 
refer to any of three categories. Parallel forms 
have equal raw score means, equal standard 
deviations, equal error structures, and equal 
correlations with other measures for any given 
population. Equivalent forms do not have the 
statistical similarity of parallel forms, but the 
dissimilarities in raw score statistics are com¬ 
pensated for in the conversions to derived 
scores or in form-specific norm tables. 
Comparable forms are highly similar in con¬ 
tent, but the degree of statistical similarity 
has not been demonstrated. See linkage. 

analytic scoring A method of scoring in 
which each critical dimension of performance 
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is judged and scored separately, and the result¬ 
ant values are combined for an overall score. In 
some instances, scores on the separate dimen¬ 
sions may also be used in interpreting perform¬ 
ance. See holistic scoring. 

anchor test A common set of items adminis¬ 
tered with each of two or more different 
forms of a test for the purpose of equating 
the scores obtained on these forms. 

assessment Any systematic method of 
obtaining information from tests and other 
sources, used to draw inferences about char¬ 
acteristics of people, objects, or programs. 

attention assessment The process of collect¬ 
ing data and making an appraisal of a person’s 
ability to focus on the relevant stimuli in a 
situation. The assessment may be directed at 
mechanisms involved in arousal, sustained 
attention, selective attention and vigilance, 
or limitation in the capacity to attend to 
incoming information. 

automated narrative report See computer- 
prepared test interpretation. 

back translation A translation of a test, 
which is itself a translation from an original 
rest, back into the language of the original 
test. The degree to which a back translation 
matches the original test indicates the accura¬ 
cy of the original translation. 

battery A set of rests usually administered as 
a unit. The scores on the several tests usually 
are scaled so that they can readily be compared 
or used in combination for decision making. 

bias In a statisrical context, a systematic 
error in a test score. In discussing test fair¬ 
ness, bias may refer to construct underrepre¬ 
sentation or construct-irrelevant components 
of test scores that differentially affect the per¬ 
formance of different groups of test takets. 
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See predictive bias, construct underrepresenta¬ 
tion, construct irrelevance. 

bilingual The characteristic of being relative¬ 
ly proficient in two languages. 

calibration 1. In linking test score scales, the 
process of setting the test score scale, includ¬ 
ing mean, standard deviation, and possibly 
shape of score distribution, so thar scores on a 
scale have the same relative meaning as scores 
on a telated scale. 2. In item response theory, 
the process of determining the parameters of 
the response function for an item. 

certification A voluntary process, often 
national in scope, by which individuals who 
have been certified have demonstrated some 
level of knowledge and skill in an occupation. 
See licensing, credentialing. 

classical test theory A psychometric theory 
based on the view that an individual’s 
observed score on a test is the sum of a true 
score component for the test raker, plus an 
independent measurement error component. 

classification accuracy The degree ro which 
neither false positive nor false negative cate¬ 
gorizations and diagnoses occur when a test 
is used to classify an individual or event. 
See sensitivity and specificity. 

coaching Planned short-term instructional 
activities in which prospective test takers par¬ 
ticipate prior to the test administration for 
the primary purpose of improving their test 
scores. Coaching typically includes simple 
practice, instruction on test-taking strategies, 
and related activities. Activities that approxi¬ 
mate the instruction provided by regular 
school curricula or training programs are 
not typically referred to as coaching. 

coefficient alpha An internal consistency 
reliability coefficient based on the number 
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of parts into which the test is partitioned 
(e.g., items, subtests, or raters), the interrela¬ 
tionships of the parts, and the total tesr score 
variance. Also called Cronbach’s alpha and, 
for dichotomous items, KR 20. 

cognitive assessment The process of system¬ 
atically gathering test scores and related data 
in order to make judgments about an individ¬ 
ual’s ability to perform various mental activi¬ 
ties involved in the processing, acquisition, 
retention, conceptualization, and organization 
of sensory, perceptual, verbal, spatial, and 
psychomotor information. 

composite score A score that combines sev¬ 
eral scores according to a specified formula. 

computer-administered test A test adminis¬ 
tered by a computer. Questions appear on a 
computer-produced display, and the test 
taker answers by using a keyboard, “mouse” 
or other similar response device. 

computer-based mastery test An adaptive 
test administered by computer that indicates 
whether or riot the test taker has mastered a 
certain domain. The test is not designed to 
provide scores indicating degree of mastery, 
but only whether the test performance was 
above or below some specified level. Thus 
a computer-based mastery test is not simply 
a mastery test given by computer. See mas- 

computer-based test See computer-adminis¬ 
tered test. 

computer-generated test interpretation 

See computer-prepared test interpretation. 

computer-prepared test interpretation A 
programmed, computer-prepared interpreta¬ 
tion of an examinee’s test results, based on 
empirical data and/or expert judgment. 


computerized adaptive test An adaptive test 
administered by computer. See adaptive testing. 

conditional measurement error variance 
The variance of measurement ettors that 
affect the scores of examinees at a specified 
test scote level; the square of the conditional 
standard error of measurement. 

conditional standard error of measurement 
The standard deviation of measurement 
errors that affect the scores of examinees at 
a specified test scote level. 

confidence interval An interval between two 
values on a score scale within which, with spec¬ 
ified probability, a score or parameter of interest 
lies. The term is also used in these standards to 
designate Bayesian credibility intervals that 
define the probability that the unknown 
parameter fells in the specified interval. 

configural scoring rule A rule for scoring a 
set of two or mote elements (such as items or 
subtests) in which the score depends on a par¬ 
ticular pattern of responses to the elements. 

construct The concept or che characteristic 
that a test is designed to measure. 

construct domain The set of interrelated 
attributes (e.g., behaviors, attitudes, values) that 
are included under a construct’s label. A test 
typically samples from this construct domain. 

construct equivalence 1. The extent to which 
the construct measured by one test is essentially 
the same as the construct measured by another 
test. 2. The degree to which a construct measured 
by a test in one cultural or linguistic group is 
comparable to the construct measured by the 
same rest in a different cultural or linguistic group. 

construct irrelevance The extent to which 
test scores are influenced by factors that are 
irrelevant to the construct that the test is 
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intended to measure. Such extraneous factors 
distort the meaning of rest scores from what 
is implied in the proposed interpretation. 

construct underrepresentation The extent 
to which a test fails to capture important 
aspects of the construct that the test is 
intended to measure. In this situation, the 
meaning of test scores is narrower than the 
proposed interpretation implies. 

construct validity A term used to indicate 
that the test scores are to be interpreted as 
indicating the test takers standing on the 
psychological construct measured by the test. 
A construct is a theoretical variable inferred 
from multiple types of evidence, which might 
include the interrelations of the test scores 
with other variables, internal test structure, 
observations of response processes, as well as 
the content of the test. In the current stan¬ 
dards, all test scores are viewed as measures 
of some construct, so the phrase is redundant 
with validity. The validity argument establish¬ 
es the construct validity of a test. See con¬ 
struct, validity argument. 

constructed response item An exercise 
for which examinees must create their own 
responses or products rather than choose a 
response from an enumerated set. Short- 
answer items require a few words or a num¬ 
ber as an answer, whereas extended-response 
items require at least a few sentences. 

content domain The set of behaviors, 
knowledge, skills, abilities, attitudes or other 
characteristics to be measured by a test, repre¬ 
sented in a detailed specification, and often 
organized into categories by which items are 
classified, 

content standard A statement of a broad 
goal describing expectations for students in 
a subject matter at a particular grade or at 
the completion of a level of schooling. 
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content validity A term used in the 1974 
Standards to refer to a kind or aspect of validi¬ 
ty that was "required when the test user wish¬ 
es to estimate how an individual performs in 
the universe of situations the test is intended 
to represent” {p. 28). In the 1985 Standards, 
the term was changed to content-related 
evidence emphasizing that it referred to one 
type of evidence within a unitary conception 
of validity. In the current Standards, this type 
of evidence is characterized as “evidence based 
on test content." 

convergent evidence Evidence based on the 
relationship between test scores and other 
measures of the same construct. 

credentialing Granting to a person, by some 
authority, a credential, such as a certificate, 
license, or diploma, chat signifies an accept¬ 
able level of performance in some domain of 
knowledge or activity. 

criterion domain The construct domain of 
a variable used as a criterion. See construct 
domain. 

criterion-referenced score interpretation 

See criterion-referenced test. 

criterion-referenced test A test that allows 
its users to make score interpretations in rela¬ 
tion to a functional performance level, as dis¬ 
tinguished from those interpretations that are 
made in relation to the performance of oth¬ 
ers. Examples of criterion-referenced interpre¬ 
tations include comparison to cut scores, 
interptetations based on expectancy tables, 
and domain-referenced score interpretations. 

cross-validation A procedure in which a 
scoring syscem or set of weights for predicting 
performance, derived from one sample, is 
applied to a second sample in order to inves¬ 
tigate the stability of prediction of the scoring 
system or weights. 
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cut score A specified point on a score scale, 
such that scores at or above that point are 
interpreted or acted upon differently from 
scores below that point. See performance 
standard. 

derived score A score to which raw scores 
are converted by numerical transformation 
(e.g., conversion of raw scores to percentile 
ranks or standard scores). 

diagnostic and intervention decisions 
Decisions based upon inferences derived from 
psychological test scores as part of an assess¬ 
ment of an individual that lead to placing the 
individual in one or more categories. See also 
intervention planning. 

differential item functioning A statistical 
property of a test item in which different 
groups of test takers who have the same total 
cest score have different average item scores 
or, in some cases, different rates of choosing 
various item options. Also known as DIF. 

discriminant evidence Evidence based on 
the relationship between test scores and 
measures of different constructs. 

documentation The body of literature (e.g., 
test manuals, manual supplements, research 
reports, publications, user’s guides, etc.) 
made available by publishers and test authors 
to support test use. 

domain sampling The process of selecting 
test items to represent a specified universe of 
performance. 

empirical evidence Evidence based on some 
form of data, as opposed to that based on logic 
or theory. As used here, the term does not 
specify the type of evidence; this is in contrast 
to some settings where the term is equated 
with criterion-related evidence of validity. 


equated forms Two or more test forms con¬ 
structed to cover the same explicit content, to 
conform to the same statistical specifications, 
and to be administered under identical proce¬ 
dures (alternate forms)-, through statistical 
adjustments, the scores on the alternate forms 
share a common scale. 

equating Putting two or more essentially par¬ 
allel tests on a common scale. See alternate forms. 

equivalent forms See alternate forms. 

error of measurement The difference 
between an observed score and the corre¬ 
sponding true score or proficiency. See stan¬ 
dard error of measurement and true score. 

factor 1. Any variable, real or hypothetical, 
that is an aspect of a concept or construct. 2. 
In measurement theory, a statistical dimension 
defined by a factor analysis. See factor analysis. 

factor analysis Any of several statistical 
methods of describing the interrelationships 
of a set of variables by statistically deriving 
new variables, called factors, that are fewer in 
number than the original set of variables. 

factorial structure 1 . The set of factors 
obtained in a factor analysis. 2. Technically, the 
correlation of each factor with each of the origi¬ 
nal variables from which the factors are derived. 

fairness In testing, the principle thac every 
test taker should be assessed in an equitable 
way. See chapter 7. 

false negative In classification, diagnosis, or 
selection, an error in which an individual is 
assessed or predicted not to meet the criteria 
for inclusion in a particular group but in 
truth does (or would) meet these criteria. See 
sensitivity and specificity. 
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false positive in classification, diagnosis, or 
selection, an error in which an individual is 
assessed or predicted to meet the criteria for 
inclusion in a particular group but in truth 
does not (or would not) meet these criteria. 
See sensitivity and specificity. 

field test A test administration used to check 
the adequacy of testing procedures, generally 
including test administration, test respond¬ 
ing, test scoring, and test reporting. A field 
test is generally more extensive than a pilot 
test. See pilot test. 

flag An indicator attached to a test score, a 
test item, or other entity to indicate a special 
status. A flagged test score generally signifies 
a score obtained in a modified, nonstandard 
test administration. A flagged test item gen¬ 
erally signifies an item with undesirable 
characteristics, such as excessive differential 
item functioning. 

functional equivalence In evaluating test 
translations, the degree to which similar activi¬ 
ties or behaviors have the same functions in 
different cultural or linguistic groups. 

gain score In testing, the difference between 
two scores obtained by a test taker on the same 
test or two equated tests taken on different 
occasions, often before and after some treatment. 

generalizability coefficient A reliability 
index encompassing one or more independ¬ 
ent sources of error. It is formed as the ratio 
of (a) the sum of variances that are considered 
components of test score variance in the set¬ 
ting under study co (b) the foregoing sum 
plus the weighted sum of variances attributa¬ 
ble to various error sources in this setting. 
Such indices, which arise from the applica¬ 
tion of generalizability theory, are typically 
interpreted in the same manner as reliability 
coefficients. See generalizability theory. 


generalizability theory An extension of clas¬ 
sical reliability cheory and methodology in 
which the magnitudes of errors from specified 
sources are estimated through the use of one 
or another experimental design, and the 
application of the statistical techniques of the 
analysis of variance. The analysis indicates the 
generalizability of scores beyond the specific 
sample of items, persons, and observational 
conditions that were studied. 

grade equivalent The school grade level for 
a given population for which a given score is 
the median score in that population. See age 
equivalent. 

high-stakes test A test used to provide results 
that have important, direct consequences for 
examinees, programs, or institutions involved 
in the testing. 

holistic scoring A method of obtaining a 
score on a test, or a test item, based on a 
judgment of overall performance using speci¬ 
fied criteria. See analytic scoring. 

informed consent The agreement of a per¬ 
son, or that persons legal representative, for 
some procedure to be performed on or by the 
individual, such as taking a test or completing 
a questionnaire. The agreement, which is usu¬ 
ally written, is made after che nature, possible 
effects, and use of the procedure has been 

intelligence test A psychological or educa¬ 
tional test designed to measure an individual's 
level of cognitive functioning in accord with 
some recognized theory of intelligence. 

internal consistency coefficient An index 
of the reliability of test scores derived from 
che statistical interrelationships of responses 
among item responses or scores on separate 
parts of a test. 
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internal structure In test analysis, the facto¬ 
rial structure of item responses or subscales 
of a test. See factorial structure. 

inter-rater agreement The consistency with 
which two or more judges rate the work or 
performance of test takers; sometimes referred 
ro as inter-rater reliability. 

intervention planning The activity of a 
practitioner that involves the development 
of a treatment protocol. 

inventory A questionnaire or checklist, usu¬ 
ally in the form of a self-report, that elicits 
information about an individual’s personal 
opinions, interests, attitudes, preferences, per¬ 
sonality characteristics, motivations, and typi¬ 
cal reactions to situations and problems. 

item A statement, question, exercise, or task 
on a test for which the test taker is to select 
or construct a response, or perform a task. 
See item prompt. 

item characteristic curve A mathematical 
function relating the probability of a certain 
item response, usually a correct response, to 
the level of che attribute measured by the 
item. Also called item response curve , or 
item response function, or icc. 

item pool The aggregate of items from 
which a test or test scale’s items are selected 
during test development, or che total set of 
items from which a particular test is selected 
for a test raker during adaptive testing. 

item prompt The question, stimulus, or 
instructions that direct the efforts of exami¬ 
nees in formulating their responses to a con- 
structed-response exercise. 

item response theory (IRT) A mathematical 
model of the relationship between perform¬ 
ance on a test item and the tesc taker’s level of 


performance on a scale of the ability, trait, or 
proficiency being measured, usually denoted 
as 0. In the case of items scored 0 / 1 (incor- 
rect/correct response) the model describes the 
relationship between 0 and the item mean score 
(P) for rest takers at level 0, over the range of 
permissible values of 0. In most applications, 
the mathematical function relating P to 0 is 
assumed to be a logistic function that closely 
resembles the cumulative normal distribution. 

job analysis A general term referring to the 
investigation of positions or job classes to 
obtain descriptive information about job 
duties and tasks, responsibilities, necessary 
worker characteristics (e.g. knowledge, skills, 
and abilities), working conditions, and/or 
other aspects of the work. 

job performance measurement The measure¬ 
ment of an incumbent’s performance of a job. 
This may include a job sample test, an assess¬ 
ment of job knowledge, and possibly ratings of 
the incumbents actual performance on the job. 

job sample test A test of the ability of an 
individual to perform the tasks of which the 
job is comprised. 

licensing The granting usually by a govern¬ 
ment agency, of an authorization or legal 
permission to practice an occupation or pro¬ 
fession. See also certification, credentialing. 

linkage The result of placing two or more 
tests on the same scale, so that scores can be 
used interchangeably. Several linking methods 
are used: See equating, calibration, modera¬ 
tion, and projection, and alternate forms. 

literature In this document, a term denoting 
accessible reports of research, such as books, 
articles published in professional journals, 
technical reports, and accessible versions of 
papers presented at professional meetings. 
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local evidence Evidence (usually related to 
reliability or validity) collected for a specific 
set of test takers in a single institution or at 
a specific location. 

local norms Norms by which test scores are 
referred to a specific, limited reference popula¬ 
tion of particular interest to the test user 
(e.g., locale, organization, or institution); 
local norms are not intended as representative 
of populations beyond that setting. 

local setting The organization or institution 
where a test is used. 

low-stakes test A test used to provide results 
that have only minor or indirect consequences 
for examinees, programs, or institutions 
involved in the testing. 

mandated tests Tests that are administered 
because of a mandate from an external authority. 

mastery test 1. A criterion-referenced test 
designed to indicate the extent to which the 
test taker has mastered some domain of knowl¬ 
edge or skill. Mastery is generally indicated by 
attaining a passing score or cut score. 2. In 
some technical use, a test designed to indicate 
whether a test taker has or has not attained a 
prescribed level of mastery of a domain. See 
cut score, computer-based mastery test. 

matrix sampling A measurement format in 
which a large set of test items is organized 
into a number of relatively short item sets, 
each of which is randomly assigned to a sub¬ 
sample of test takers, thereby avoiding the 
need to administer all items to all examinees 
in a program evaluation. 

meta-analysis A statistical method of research 
in which the results from several independent, 
comparable studies are combined to determine 
the size of an overall effect or the degree of 
relationship between two variables. 
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moderation In test linking, the term moder¬ 
ation, used without a modifier, usually signifies 
statistical moderation, which is the adjustment 
of the score scale of one test, usually by setting 
the mean and standard deviation of one set of 
test scores to be equal to the mean and standard 
deviation of another distribution of test scores. 

moderator variable In regression analysis, a 
variable that serves to explain, at least in part, 
the correlation of two other variables. 

modification See test modification. 

neuropsychodiagnosis Classification or 
description of inferred central nervous sys¬ 
tem status on the basis of neuropsychological 
assessment. 

neuropsychological assessment A specialized 
type of psychological assessment of normal or 
pathological processes affecting the central 
nervous system and the resulting psychological 
and behavioral functions or dysfunctions. 

norm-referenced test interpretation A score 
interpretation based on a comparison of a cesc 
taker’s performance to the performance of 
ocher people in a specified reference popula¬ 
tion. See criterion-referenced test. 

normalized standard score A derived test 
score in which a numerical transformation 
has been chosen so that the score distribution 
closely approximates a normal distribution, 
for some specific population. 

norms Statistics or tabular data that summa¬ 
rize the distribution of test performance for 
one or more specified groups, such as test tak¬ 
ers of various ages or grades. Norms are usually 
designed to represent some larger population, 
such as test takers throughout che country. The 
group of examinees represented by the norms is 
referred to as the reference population. 
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operational use The actual use of a test, 
after initial test development has been com¬ 
pleted, to inform art interpretation, decision, 
or action based, in part, upon test scores. 

outcome evaluation An evaluation of the 
efficacy of an intervention. 

parallel forms See alternate forms. 

percentile The score on a test below which a 
given percentage of scores fall. 

percentile rank Most commonly, the per¬ 
centage of scores in a specified distribution 
that fall below the point at which a given 
score lies. Sometimes the percentage is defined 
to include scores that fall at the point; some¬ 
times the percentage is defined to include half 
of the scores at the point. 

performance assessments Product- and 
behavior-based measurements based on set¬ 
tings designed to emulate real-life contexts 
or conditions in which specific knowledge 
or skills are actually applied. 

performance standard 1. An objective defi¬ 
nition of a certain level of performance in 
some domain in terms of a cut score ot a 
range of scores on the score scale of a test 
measuring proficiency in that domain. 2. A 
statement or description of a set of opera¬ 
tional tasks exemplifying a level of perform¬ 
ance associated with a more general content 
standard; the statement may be used to guide 
judgments about the location of a cut score 
on a score scale. The term often implies a 
desired level of performance. See cut score. 

personality inventory An inventory that 
measures one or more characteristics that are 
regarded generally as psychological attributes 
or interpersonal proclivities or skills. 


pilot test A test administered to a sample of 
test takers co tty out some aspects of the test 
or test items, such as instructions, time limits, 
item response formats, or item response 
options. See field test. 

policy The principles, plan, or procedures 
established by an agency, institution, organi¬ 
zation, or government, generally with the 
intent of reaching a long-term goal. 

portfolio In assessment, a systematic collec¬ 
tion of educational or work products that 
have been compiled or accumulated over 
time, according to a specific set of principles. 

precision of measurement A general term 
that refers to a measures sensitivity to meas¬ 
urement error. See standard error of measure¬ 
ment, error of measurement. 

practice analysis A general term referring to 
the investigation of a certain work position, or 
profession, to obtain descriptive information 
about the activities and responsibilities of the 
position and about the knowledge, skills, and 
abilities needed to engage in the work of the 
position. The concept is essentially the same as 
a job analysis but is generally preferred for pro¬ 
fessional occupations involving a great deal of 
individual decision making. See job analysis. 

predictive bias The systematic under- or over¬ 
prediction of criterion performance for people 
belonging to groups differentiated by character¬ 
istics not relevant to criterion performance. 

predictive validity A term used in the 1974 
Standards to refer to a type of “criterion-related 
validity” that applies “when one wishes co infer 
from a test score an individuals most probable 
standing on some other variable called a crite¬ 
rion” (p. 26). In the 1985 Standards, the term 
criterion-related validity was changed to criteri¬ 
on-related evidence, emphasizing that it referred 
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to one type of evidence within a unitary con¬ 
ception of validity. The current document refers 
to “evidence based on relations to other vari¬ 
ables” that include “test-criterion relationships." 
Predictive evidence indicates how accurately 
test data can predict criterion scores that are 
obtained at a later time. 

program evaluation The collection and syn¬ 
thesis of systematic evidence about the use, 
operation, and effects of some planned set of 
procedures. 

program norms See user norms. 

projection In test scaling, a method of linking 
in which scores on one test (X) are used to pre¬ 
dict scores on another test (Y). The projected Y 
score is the average Y score for all persons with 
a given X score. Like regression, the projection 
of test Y onto test X is different from the pro¬ 
jection of test X onto test Y. See linkage. 

proposed interpretation A summary, or a 
set of illustrations, of the intended meaning 
of test scores, based on the construct(s) or 
concept(s) the test is designed to measure. 

protocol A record of events. A test protocol 
will usually consist of the test record and test 

psychodiagnosis Formalization or classification 
of functional mental health status based on psy¬ 
chological assessment. See neuropsychodiagnosis. 

psychological assessment A comprehensive 
examination of psychological functioning that 
involves collecting, evaluating, and integrating 
test results and collateral information, and report¬ 
ing information about an individual. Various 
methods may be used to acquire information 
during a psychological assessment: administer¬ 
ing, scoring and interpreting tests and invento¬ 
ries; behavioral observation; client and third-party 
interviews; analysis of prior educational, occu¬ 
pational, medical, and psychological records. 


psychological testing Any procedure that 
involves the use of tests or inventories to 
assess particular psychological characreristics 
of an individual. 

random error An unsystematic error; a quan¬ 
tity (often observed indirectly) that appears to 
have no relationship to any other variable. 

random sample See sample. 

raw score The unadjusted score on a test, 
often determined by counting the number of 
correct answers, but more generally a sum or 
other combination of item scores. In icem 
response theory, the estimate of tesr taker 
proficiency, usually symbolized 0, is analogous 
to a raw score although, unlike a raw score, 
its scaling is nor arbitrary. 

reference population The population of test 
takers represented by test norms. The sample 
on which the test norms are based must per¬ 
mit accurate estimation of the test score dis¬ 
tribution for the reference population. The 
reference population may be defined in terms 
of examinee age, grade, or clinical status at 
time of testing, or other characteristics. 

relative score interpretation The meaning 
of the test score for an individual, or the aver¬ 
age score for a definable group, derived from 
the rank of the score or average within one or 
more reference distributions of scores. See 
absolute score interpretation. 

reliability The degree co which test scores 
for a group of test takers are consistent over 
repeated applications of a measurement pro¬ 
cedure and hence are inferred to be depend¬ 
able, and repeatable for an individual test 
taker; the degree to which scores are free of 
errors of measurement for a given group. 
See generalizability theory. 
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reliability coefficient A unit-free indicator 
that reflects the degree to which scores are 
free of measurement error. The indicator 
resembles (or is) a product-moment correla¬ 
tion. In classical tesr rheory, the term repre¬ 
sents the ratio of true score variance to 
observed score variance for a particular exam¬ 
inee population. The conditions under which 
the coefficient is estimated may involve varia¬ 
tion in tesr forms, measurement occasions, 
raters, scorers, or clinicians, and may entail 
multiple examinee produces or performances. 
These and other variations in conditions give 
rise to qualifying adjectives, such as alter¬ 
nate-form reliability, internal consistency 
reliability, test-retest reliability, etc. See 
generalizability theory. 

response bias A test takers tendency to 
respond in a particular way or style to items 
on a test (i.e., acquiescence, social desirability, 
the tendency to choose 'true’ on a true-false 
test) that yields systematic, construct-irrele- 


response process A component, usually 
hypothetical, of a cognitive account of some 
behavior, such as making an item response. 

response protocol A record of the responses 
given by a test taker to a particular test. 

restriction of range or variability Reduction 
in the observed score variance of an examinee 
sample, compared ro the variance of the entire 
examinee population, as a consequence of con¬ 
straints on the process of sampling examinees. 
See adjusted validity/reliability coefficient. 

rubric See scoring rubric. 

sample A selection of a specified number of 
entities called sampling units (test takers, items, 
etc.) from a larger specified set of possible 


entities, called the population. A random 
sample is a selection according to a random 
process, with the selection of each entity in no 
way dependent on the selection of other enti¬ 
ties. A stratified random sample is a set of ran¬ 
dom samples, each of a specified size, from 
several different sets, which are viewed as stra¬ 
ta of che population. 

scale 1. The system of numbers, and their 
units, by which a value is reported on some 
dimension of measurement. Length can be 
reported in the English system of feet and 
inches or in the metric system of meters and 
centimeters. 2. In resting, scale sometimes 
refers to the set of items or subtests used in 
the measurement and is distinguished from a 
test in the type of characteristic being meas¬ 
ured. One speaks of a test of verbal ability, 
but a scale of extroversion-introversion. 

scale score See derived score. 

scaling The process of creating a scale or a 
scale score. Scaling may enhance test scote 
interpretation by placing scores from different 
tests or test forms onto a common scale or by 
producing scale scores designed to support 
critetion-refetenced or norm-referenced score 
interpretations. See scale. 

score Any specific number resulting from 
the assessment of an individual; a generic 
term applied for convenience to such diverse 
measures as test scores, estimates of latent 
variables, production counts, absence records, 
coutse grades, ratings, and so forth. 

scoring formula The formula by which the 
raw score on a test is obtained. The simplest 
scoring formula is “raw score equals number 
correct.” Other formulas differentially weight 
item responses. For example, in an attempt to 
correct for guessing or nonresponse, zero 
weights may be assigned to nonresponses and 
negative weights to incorrect responses. 
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scoring rubric The established criteria, 
including rules, principles, and illustrations, 
used in scoring responses to individual items 
and clusters of items. The term usually refets 
to the scoring procedures for assessment tasks 
that do not provide enumerated responses 
from which test takers make a choice. Scoring 
rubrics vary in the degree of judgment 
entailed, in the number of disdnct score levels 
defined, in the latitude given scorers for assign¬ 
ing intermediate or fractional score values, 
and in other ways. 

screening test A test that is used to make 
broad categorizations of examinees as a first step 
in selection decisions or diagnostic processes. 

security (of a test) See test security. 

selection A purpose for testing that results 
in the acceptance or rejection of applicants 
for a particular educational or employment 
opportunity. 

sensitivity In classification of disorders, the 
proportion of cases in which a disorder is 
detected when it is in fact present. 

Spearman-Brown formula A formula 
derived within classical test theory that proj¬ 
ects the reliability of a shortened or length¬ 
ened test from the reliability of a test of 
specified length. 

specificity In classification of disorders, the 
proportion of cases for which a diagnosis of 
disorder is rejected when rejection is warrant¬ 
ed. 

speededness A test characteristic, dictated 
by the test’s time limits, that results in a test 
takers score being dependent on the rate at 
which work is performed as well as the cor¬ 
rectness of the responses. The term is not 
used to describe tests of speed. Speededness 
is often an undesirable characteristic. 
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split-halves reliability coefficient An inter¬ 
nal consistency coefficient obtained by using 
half the items on the test to yield one score 
and the other half of the items to yield a sec¬ 
ond, independent score. The correlation 
between the scores on these wo half-tests, 
adjusted via the Spearman-Brown formula, 
provides an estimate of the alternate-form 
reliability of the total test. 

stability The extent to which scores on a test 
are essentially invariant over time. Stability is 
an aspect of reliability and is assessed by corre¬ 
lating the test scores of a group of individuals 
with scores on the same test, or an equated 
test, taken by the same group at a later time. 

standard error of measurement The stan¬ 
dard deviation of an individual’s observed 
scores from repeated administrations of a test 
(or parallel forms of a test) under identical 
conditions. Because such data cannot general¬ 
ly be collected, the standard error of measure¬ 
ment is usually estimated from group data. 
See error of measurement. 

standard score A cype of derived score such 
that the distribution of these scores for a 
specified population has convenient, known 
values for the mean and standard deviation. 
The term is sometimes used to signify a mean 
of 0.0 and a standard deviation of 1.0. See 
derived score. 

standardization 1. In test administration, 
maintaining a constant testing environment 
and conducting the test according ro detailed 
rules and specifications, so that testing condi¬ 
tions are the same for all test takers. 2. In test 
development, establishing scoring norms 
based on the test performance of a representa¬ 
tive sample of individuals with which the test 
is intended to be used. 3. In statistical analy¬ 
sis, transforming a variable so that its stan¬ 
dard deviation is 1.0 for some specified 
population or sample. See standard score. 
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standards-based assessment Assessments 
intended to represent systematically described 
content and performance standards. 

stratified coefficient alpha A modification 
of coefficient alpha that renders it appropriate 
for a muiti-factor test by defining the total 
score as the composite of scores on single-fac- 

stratified sample See sample. 

systematic error A consistent score compo¬ 
nent (often observed indirectly), not related 
to the test performance. See bias. 

technical manual A publication prepared by 
test authors and publishers to provide techni¬ 
cal and psychometric information on a test. 

test An evaluative device or procedure in which 
a sample of an examinees behavior in a specified 
domain is obtained and subsequently evaluated 
and scored using a standardized process. 

test developer The person(s) or agency 
responsible for the construction of a test and 
for the documentation regarding its technical 
quality for an intended purpose. 

test development The process through which 
a test is planned, constructed, evaluated, and 
modified, including consideration of content, 
format, administration, scoring, item proper¬ 
ties, scaling, and technical quality for its 
intended purpose. 

test documents Publications such as test 
manuals, technical manuals, user’s guides, 
specimen sets, and directions for test adminis¬ 
trators and scorers that provide information for 
evaluating the appropriateness and technical 
adequacy of a test for its intended purpose. 

test information function A mathematical 


function relating each level of an ability or 
latent trait, as defined under item response the¬ 
ory (IRT), to the reciprocal of the correspon¬ 
ding conditional measurement error variance. 

test manual A publication prepared by test 
developers and publishers to provide informa¬ 
tion on test administration, scoring, and 
interpretation and to provide technical data 
on test characteristics. See user's guide. 

test modification Changes made in the con¬ 
tent, format, and/or administration procedure 
of a test in order to accommodate test takers 
who are unable to take the original test under 
standard test conditions. 

test security Limiting access to the specific 
content of a test to those who need to know 
it for test development, test scoring, and test 
evaluation. In particular, test items on secure 
tests are not published; unauthorized copying 
is forbidden by any test taker or anyone other¬ 
wise associated with the test. A secure test is 
not for publication in any form, in any venue. 

test specifications A detailed description for 
a test, often called a test blueprint, that speci¬ 
fies the number or proportion of items that 
assess each content and process/skill area; 
the format of items, responses, and scoring 
rubrics and procedures; and the desired psy¬ 
chometric properties of the items and test 
such as the distribution of item difficulty 
and discrimination indices. 

test user The person(s) or agency responsible 
for the choice and administration of a test, 
for the interpretation of test scores produced 
in a given context, and for any decisions or 
actions that are based, in part, on test scores. 

test-retest reliability A reliability coefficient 
obtained by administering the same test a sec¬ 
ond time to the same group after a time 
interval and correlating the two sets of scores. 
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timed tests A test administered to a test 
taker who is allotted a strictly prescribed 
amount of time to respond to the test. 

top-down A method of selecting the best 
applicants according to some numerical scale 
of suitability. Often, “best" is taken to mean 
“highest scoring on some test.” 

translational equivalence The degree to 
which the translated version of a test is equiv¬ 
alent to the original test. Translational equiva¬ 
lence is typically examined in terms of the 
language used, the scores produced, and the 
constructs measured by the translated version 
and the original test. See back translation. 

true score In classical test theory, the average 
of the scores that would be earned by an indi¬ 
vidual on an unlimited number of perfecdy 
parallel forms of the same test. In item 
response theory, the error-free value of test 
taker proficiency, usually symbolized by 8. 

unidimensionai Having only one dimension, 
or only one latent variable. 

user norms Descriptive statistics (including 
percentile ranks) for a sample of test takers 
that does not represent a well-defined refer¬ 
ence population, for example, all persons test¬ 
ed during a certain period of time, or a set of 
self-selected test takers. Also called program 
norms. See norms. 

user’s guide A publication prepared by the 
tesc authors and publishers to provide infor¬ 
mation on a test’s purpose, appropriate uses, 
proper administration, scoring procedures, 
normative data, interpretation of results, and 
case studies. See test manual. 

validation The process through which the 
validity of the proposed interpretation of test 


validity The degree to which accumulated 
evidence and theory support specific interpre¬ 
tations of test scores entailed by proposed 
uses of a test. 

validity argument An explicit scientific justi¬ 
fication of the degree to which accumulated 
evidence and theory support the proposed 
interpretation(s) of test scores. 

validity generalization Applying validity 
evidence obtained in one or more situations 
to other similar situations on the basis of 
simultaneous estimation, meta-analysis, or 
synthetic validation arguments. 

variance components In testing, variances 
accruing from the separate constituent 
sources that are assumed to contribute to the 
overall variance of observed scores. Such vari¬ 
ances, estimated by methods of the analysis 
of variance, often reflect situation, location, 
time, test form, rater, and related effects. 

vocational assessment A specialized type of 
psychological assessment designed to generate 
hypotheses and inferences about interests, 
work needs and values, career development, 
vocational maturity, and indecision. 

weighted scoring A method of scoring a test 
in which the number of points awarded for a 
correct (or diagnostically relevant) response is 
not the same for all items in the test. In some 
cases, the scoring formula awards more points 
for one response to an item than for another. 
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1 

much narrower area, and I'd say the 

01 : 34:21 

2 

representations in our seven-point digital rights 

01 : 34:25 

3 

management plan were the primary mechanism that we 

01 : 34:32 

4 

dealt with that particular concern of the 

01 : 34:37 

5 

publishing industry. 

01 : 34:40 

6 

BY MR. HUDIS: 

01 : 34:41 

7 

Q. Okay. The last sentence on that page. 

01 : 34:46 

8 

page 15 of Exhibit 55, it says: 

01 : 34:49 

9 

"With the extensive input 

01 : 34:51 

10 

from consumers, authors. 

01 : 34:54 

11 

publishers and leading 

01 : 34:56 

12 

organizations, we have created a 

01 : 34:57 

13 

model for Bookshare that can be 

01 : 34:59 

14 

supported by a broad array of 

01 : 35:01 

15 

interests." 

01 : 35:04 

16 

What model is this passage talking 

01 : 35:05 

17 

about? 

01 : 35:08 

18 

MR. KAPLAN: Objection. Lacks 

01 : 35:09 

19 

foundation. 

01 : 35:10 

20 

THE WITNESS: The Bookshare operational 

01 : 35:14 

21 

model. 

01 : 35:17 

22 

BY MR. HUDIS: 

01 : 35:21 

23 

Q. How would you describe the Bookshare 

01 : 35:21 

24 

operational model? 

01 : 35:22 

25 

A. A package of technologies and policies 

01 : 35:24 
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1 

and legal agreements and product features and — I 

01 : 35:27 

2 

mean, you know, it's a — these things combined 

01 : 35:33 

3 

create a service that delivers a value to people 

01 : 35:38 

4 

with disabilities in a way that gets support from 

01 : 35:46 

5 

these different stakeholders. 

01 : 35:48 

6 

Q. Including the publishing industry? 

01 : 35:53 

7 

A. Yes. 

01 : 35:55 

8 

Q. Could we turn to page 16 of Exhibit 55. 

01 : 35:57 

9 

Under copyright information, it says: 

01 : 36:00 

10 

"Bookshare is an online 

01 : 36:02 

11 

library that provides accessible 

01 : 36:04 

12 

eBooks to people with print 

01 : 36:06 

13 

disabilities. Bookshare meets the 

01 : 36:07 

14 

reguirements of the Chafee 

01 : 36:09 

15 

Amendment which permits an 

01 : 36:09 

16 

authorized entity like Benetech to 

01 : 36:12 

17 

make books available to people 

01 : 36:14 

18 

with print disabilities provided 

01 : 36:16 

19 

that copies may not be reproduced 

01 : 36:17 

20 

or distributed in a format other 

01 : 36:19 

21 

than a specialized format 

01 : 36:21 

22 

exclusively for use by blind or 

01 : 36:23 

23 

other persons with disabilities. 

01 : 36:25 

24 

Must bear a notice that any 

01 : 36:27 

25 

further reproduction or 

01 : 36:32 
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1 

distribution in a format other 

01 : 36:33 

2 

than a specialized format is an 

01 : 36:35 

3 

infringement. Must include a 

01 : 36:37 

4 

copyright notice identifying the 

01 : 36:39 

5 

copyright owner and the date of 

01 : 36:43 

6 

the original publication. 

01 : 36:45 

7 

'Specialized formats' means 

01 : 36:46 

8 

Braille, audio or digital text 

01 : 36:50 

9 

which is exclusively intended for 

01 : 36:53 

10 

use by blind or other persons with 

01 : 36:54 

11 

disabilities." 

01 : 36:56 

12 

All right. So I've read this passage. 

01 : 36:59 

13 

Mr. Fruchterman. 

01 : 37:01 

14 

A. Right. 

01 : 37:01 

15 

Q. Does this accurately describe the 

01 : 37:01 

16 

overall way that Benetech makes reading materials 

01 : 37:03 

17 

available to its members? 

01 : 37:07 

18 

MR. KAPLAN: Objection. Vague. 

01 : 37:08 

19 

Misleading. 

01 : 37:09 

20 

THE WITNESS: I think that these bullet 

01 : 37:14 

21 

points that you just read recapitulate the 

01 : 37:16 

22 

provisions of the Chafee Amendment, which is the 

01 : 37:19 

23 

primary copyright exception that we use for making 

01 : 37:23 

24 

copyright material to people with qualifying 

01 : 37:26 

25 

disabilities inside the United States. 

01 : 37:28 
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1 

BY MR. HUD IS : 

01:37:31 

2 

Q. If we could go to page 17 of Exhibit 55. 

01:37:31 

3 

What is the purpose of this page on 

01:37:36 

4 

Bookshare's web site? 

01:37:38 

5 

MR. KAPLAN: Objection. Vague. Lacks 

01:37:40 

6 

foundation. 

01:37:41 

7 

THE WITNESS: This is part of our. 

01:37:44 

8 

essentially, freguently asked questions, and it's 

01:37:45 

9 

entitled "Digital Millennium Copyright Act." 

01:37:49 

10 

And so as a — and I'm not a lawyer, but 

01:37:54 

11 

my understanding is is someone who provides access 

01:37:58 

12 

to copyrighted material online, we are required to 

01:38:02 

13 

have a DMCA agent to accept notices that there is 

01:38:06 

14 

content on our web site that infringes the 

01:38:12 

15 

copyright of others. 

01:38:14 

16 

We frequently get DMCA notices from 

01:38:17 

17 

authors or their agents or publishers saying. We 

01:38:23 

18 

searched the web. This copyright work is on your 

01:38:26 

19 

web site. Take it down. 

01:38:29 

20 

And this is both explaining the DMCA 

01:38:30 

21 

notice process at some level, as well as the, more 

01:38:36 

22 

or less, if you don't know what the Chafee 

01:38:40 

23 

Amendment is, you should look it up because we're 

01:38:42 

24 

allowed to have it. 

01:38:47 

25 

But I'm summarizing this in very direct 

01:38:48 
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terms, because it's very rare for someone to issue 
us a DMCA notice that results in us actually 
taking down the work because it's usually legally 
permitted under the copyright amendment. 

BY MR. HUDIS: 

Q. The Chafee Amendment to the copyright? 

A. The Chafee Amendment. Or often a 
license from the author's publisher who gave us 
the content, but the author and their agent 
weren't aware this was one of the nice things that 
their publisher did for their entire catalog of 
books, not just that author. 

Q. Mr. Fruchterman, could we turn to page 
18 of Exhibit 55. 

Is this text on page 18 Bookshare's 
digital rights plan — digital rights management 
plan? 

A. This is the current or, let's just say, 
last month's current — but I don't believe it's 
changed since last month — version of our 
seven-point digital rights management plan that we 
have discussed earlier. 

Q. And what was the purpose of Bookshare 
implementing this DRM plan? 

MR. KAPLAN: Objection. Vague. Lacks 
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foundation. 

THE WITNESS: I would say that the 
purpose of this was to represent to the 
intellectual property industry, especially 
publishers, that we were intending to follow the 
law when it came to use of these materials. So it 
was created for that original conversation we had 
with the publishing industry quite a number of 
years ago. 

BY MR. HUDIS: 

Q. And when you say "these materials," 
that's the copyrighted materials on the Bookshare 
web site? 

MR. KAPLAN: Objection. Misstates 

testimony. 

THE WITNESS: Yes. 

BY MR. HUDIS: 

Q. Could we turn to page 19. 

A. Mh-hmm. 

Q. What's the purpose of this sign-up page? 

That's page 19 of Exhibit 55. 

MR. KAPLAN: Objection. Vague. Lacks 
foundation. 

THE WITNESS: This is a screen shot that 
appears to be of the individual sign-up for 
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Bookshare that is collecting data about a 
potential user in order to start the process of 
becoming a Bookshare member. 

BY MR. HUDIS: 

Q. And at the bottom it says — it has a 
check box, and then you would sign your name or 
its equivalent. 

Do you see at the bottom? 

A. Yes. 

Q. And by doing so you're agreeing to the 

terms and conditions of the Bookshare web site. 

Do you see that? 

MR. KAPLAN: Objection. Is the — the 
question is whether or not he sees that check box? 

MR. HUDIS: Counsel, good. 

Q. Is the purpose of this check box to have 
the user acknowledge that he or she is agreeing to 
the terms and conditions of the Bookshare web 
site? 

MR. KAPLAN: Objection. Vague. Lacks 
foundation. 

MR. HUDIS: Thank you. Counsel. 

THE WITNESS: Yes. I believe that that 
check box and the filling in of your name 
indicates that you're agreeing to the terms and 
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conditions of our — of our — of our agreement, 
of our Bookshare individual membership agreement. 

BY MR. HUDIS: 

Q. And if you could turn to page 20 of 
Exhibit 55. Are those the terms and conditions of 
the — of the Bookshare web site? 

A. It appears to be our standard Bookshare 
membership agreement of a recent date. 

MR. HUDIS: Counsel, same reguest. Can 
we stipulate this is a business record of 
Benetech? 

MR. KAPLAN: Subject to your 
representation that this is — each page 
represents a complete Snagit screen shot of a 
particular web site or web page of the Benetech 
web site, I believe so. 

But can we go off the record for just a 

second? 

MR. HUDIS: Yes. I consent. We can go 
off the record. 

THE VIDEOGRAPHER: Okay. Going off the 
record at 1:43. 

(Discussion held off record.) 

THE VIDEOGRAPHER: Back on the record at 

1:43. 
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MR. KAPLAN: So subject to Counsel's 
representation regarding the contents of this 
exhibit, we stipulate to its authenticity as 
select web pages from the Benetech web site. 

MR. HUDIS: All right. Now, that's the 
authenticity. What about business record? That 
was what I was concerned about. You stipulated to 
the authenticity. We do have — I do — 

MR. KAPLAN: You want a stipulation that 
the statements in here are not hearsay for the 
purpose of — 

MR. HUDIS: For what they contain. 

MR. KAPLAN: I don't believe we can 
stipulate that — to that because, as far as I 
know, we don't represent Benetech. 

BY MR. HUDIS: 

Q. All right. So if you could — if, 

Mr. Fruchterman, you could put Exhibit 55 back in 
front of you. 

A. Yes. 

Q. All right. So the pages on Exhibit 55, 

I'm going to represent to you that they are Snagit 
screen shots of the Bookshare web site. 

So my question is are these pages items 
of data compilations made by Benetech? 
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1 

MR. HUDIS: No. I'm having problems 

03:03:29 

2 

with — 

03:03:30 

3 

MR. KAPLAN: Scrolling. 

03:03:31 

4 

MR. HUDIS: — what was put down as your 

03:03:34 

5 

answer. 

03:03:34 

6 

(Record read by the reporter 

03:03:41 

7 

as follows: 

03:03:41 

8 

ANSWER: As before, I would 

03:03:41 

9 

change numbers that are based on 

03:03:41 

10 

the date of this declaration.) 

03:03:41 

11 

BY MR. HUDIS: 

03:03:41 

12 

Q. So which numbers would you change? 

03:03:42 

13 

MR. KAPLAN: Objection. Vague. 

03:03:44 

14 

THE WITNESS: Yeah. Paragraph 1 — 2 — 

03:03:45 

15 

sorry, paragraph 2, I cite how many users, how 

03:03:48 

16 

many books, what our monthly capacity is. I would 

03:03:52 

17 

update those to current figures. 

03:03:58 

18 

BY MR. HUDIS: 

03:03:59 

19 

Q. So it would be more? 

03:03:59 

20 

A. Yes. 

03:04:01 

21 

MR. KAPLAN: Description. 

03:04:01 

22 

THE WITNESS: Sorry. 

03:04:02 

23 

That's it. 

03:04:26 

24 

BY MR. HUDIS: 

03:04:27 

25 

Q. Are paragraphs 4 through 12 of 

03:04:27 
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1 

Exhibit 60 still today an accurate description of 

03:04:32 

2 

Bookshare's seven-point digital rights management 

03:04:36 

3 

plan? 

03 : 04:40 

4 

MR. KAPLAN: Objection. Vague. 

03:04:40 

5 

THE WITNESS: Yes. 

03:04:47 

6 

BY MR. HUDIS: 

03:04:49 

7 

Q. If we could turn to paragraph 1, page 1, 

03:04:50 

8 

of Exhibit 60. You say: 

03:04:53 

9 

"Based upon my experience 

03:04:55 

10 

with the Bookshare online library 

03:04:57 

11 

for people with print 

03:04:59 

12 

disabilities, I believe that the 

03:05:00 

13 

risk of online piracy or 

03:05:02 

14 

unauthorized copying and 

03:05:04 

15 

distribution of works made fully 

03:05:05 

16 

available to individuals" — 

03:05:07 

17 

"individuals with print 

03:05:09 

18 

disabilities through the 

03:05:12 

19 

HathiTrust is minimal." 

03:05:13 

20 

What was the basis for this statement 

03:05:17 

21 

that you made in paragraph 1? 

03:05:19 

22 

MR. KAPLAN: Objection. Confusing. The 

03:05:23 

23 

document speaks for itself. Vague. 

03:05:25 

24 

THE WITNESS: My declaration explains 

03:05:29 

25 

why, at length. 

03:05:31 
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BY MR. HUDIS: 

Q. Why is there no discussion of the 
HathiTrust security measures in this declaration 
of Exhibit 60? 

MR. KAPLAN: Objection. Argumentative. 

Vague. 

And I will instruct the witness not to 
answer to the extent that it calls for privileged 
communications or information protected by Rule 26 
of the Federal Rules of Civil Procedure. 

BY MR. HUDIS: 

Q. Mr. Fruchterman, first of all, will you 
adhere to counsel's instructions? 

A. Yes. 

MR. KAPLAN: First — 

BY MR. HUDIS: 

Q. And can you — 

MR. KAPLAN: Yeah. Okay. 

BY MR. HUDIS: 

Q. And can you answer my guestion without 
revealing the substance of attorney-client 
communications? 

A. No. 

Q. In making the statement "I believe that 

the risk of online piracy or unauthorized copying 
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1 

and distribution of works made fully available to 

03 : 06:33 

2 

individuals with print disabilities through the 

03:06:35 

3 

HathiTrust is minimal," did you review the 

03:06:39 

4 

security measures on the HathiTrust web site? 

03:06:42 

5 

MR. KAPLAN: Objection. Vague. 

03:06:46 

6 

THE WITNESS: Not beyond previously 

03:06:50 

7 

discussed. 

03:06:52 

8 

BY MR. HUDIS: 

03:06:55 

9 

Q. Mr. Fruchterman, do you recall what the 

03:07:11 

10 

outcome was in the HathiTrust litigation? 

03:07:13 

11 

MR. KAPLAN: Objection. Vague. Calls 

03:07:17 

12 

for a legal conclusion. Lacks foundation. 

03:07:18 

13 

THE WITNESS: I do. 

03:07:20 

14 

MR. KAPLAN: I'm sorry. Scratch the 

03:07:20 

15 

last objection. 

03:07:22 

16 

THE WITNESS: I do. 

03:07:24 

17 

BY MR. HUDIS: 

03:07:26 

18 

Q. All right. What — and what was — what 

03:07:27 

19 

is your understanding of the outcome of the 

03:07:28 

20 

HathiTrust litigation? 

03:07:30 

21 

A. That the motion for summary judgment by 

03:07:34 

22 

the defendants was granted by the district court 

03:07:38 

23 

judgment and upheld in an appellate court 

03 : 07:46 

24 

decision. 

03:07:51 

25 

Q. And did you — did you review the 

03:07:57 


CONTAINS CONFIDENTIAL INFORMATION 
PLANET DEPOS I 888.433.3767 I WWW.PLANETDEPOS.COM 

JA2451 




Case 

Case 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

12 

13 

14 

15 

16 

17 

18 

19 

20 

21 

22 

23 

24 

25 


: 59 

: 04 

: 34 

: 34 

: 40 

: 41 

: 45 

: 50 

: 52 

: 58 

: 02 

: 04 

: 06 

: 07 

: 07 

: 08 

: 10 

: 12 

: 14 

: 17 

: 19 

: 22 

: 23 

: 2 6 

: 2 9 


l:14-cv-00857-TSC Document 60-30 Filed 12/21/15 Page 231 of 386 
#17-7 Q@§nfid0®SM n \^0fe@ia^)fe§®@positionl = iilpJ : laifiJ^/3Q'iihterftSfi e 148 
Conducted on September 8, 2015 


229 

district court's opinion after it was issued? 

A. I did. 

(Whereupon, Deposition Exhibit 61 was 
marked for identification.) 

BY MR. HUDIS: 

Q. Mr. Fruchterman, I'd like you to turn to 
page 4 of what's now been marked as Exhibit 61. 

It is the district court's opinion in the Authors 
Guild, Inc. versus HathiTrust, et al., reported at 
902 F. Supp .2d 445 and the date of the decision is 
October 10, 2012. 

MR. KAPLAN: Counsel, it's a Westlaw 

printout. 

MR. HUDIS: Yes. 

MR. KAPLAN: Including Westlaw's 
commentary and descriptions and additional 
material that was not contained in the original 
decision. 

MR. HUDIS: Noted. 

Q. Mr. Fruchterman, could you please turn 
to page 4 of the document. 

A. Yes. 

Q. And it says, under "Background," 

"Defendants"— are you with me? 

A. Yes. 
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1 

Q. All right. 

03 : 09:30 

2 

"Defendants have entered into 

03:09:31 

3 

agreements with Google Inc. that 

03:09:31 

4 

allow Google to create digital 

03:09:33 

5 

copies of works in the 

03:09:35 

6 

universities' libraries in 

03:09:38 

7 

exchange for which Google provides 

03:09:39 

8 

digital copies to defendants, the 

03:09:41 

9 

mass digitization product or MDP." 

03:09:44 

10 

Was that your understanding of how the 

03:09:47 

11 

HathiTrust library worked? 

03:09:49 

12 

MR. KAPLAN: Objection. Vague. 

03:09:53 

13 

Confusing. 

03 : 09:56 

14 

THE WITNESS: Yes. Generally. 

03:09:59 

15 

BY MR. HUDIS: 

03:10:01 

16 

Q. All right. If we could turn to page 5 

03:10:01 

17 

of Exhibit 61. At the top left-hand corner, it 

03:10:03 

18 

says : 

03:10:10 

19 

"After digitization, Google 

03:10:10 

20 

retains a copy of the digital book 

03:10:11 

21 

that is available through Google 

03:10:13 

22 

Books, an online system through 

03:10:14 

23 

which Google users can search the 

03:10:17 

24 

content and view snippets of the 

03:10:19 

25 

books. Google also provides a 

03:10:21 
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away? 

Q. Soon. 

A. Or are these the only two I need to have 
out now? 

Q. Those are the only two you have to have 
out now. 

A. Okay. I have those two documents in 

front of me. Exhibit 55 and 60. 

Q. Okay. So I would like to focus your 

attention on — in the supplemental declaration. 
Exhibit 60, to pages 2 and 3, where you talk about 
the digital rights management plan. 

A. Yes. 

Q. Okay. And similarly, an explanation of 
the DRM plan on page 18 of Exhibit 55. And that's 
the Bookshare web site. 

A. Okay. 

Q. During your review of Public.Resource's 

web site, how did their web site compare with the 
Bookshare web site in terms of employing a digital 
rights management or DRM plan to protect the 
digital copies of standards posted on 
Public.Resource's web site from unauthorized 
copying? 

MR. KAPLAN: Objection. Vague. Calls 
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for a legal conclusion. Confusing. 

THE WITNESS: I didn't find a DRM plan 
in evidence on the Public.Resource.Org site. 

MR. HUDIS: I'd like to take a break for 
five minutes. 

THE VIDEOGRAPHER: Going off the record 

at 5:33. 

(Whereupon, a recess was taken.) 

THE VIDEOGRAPHER: Back on the record at 

5:39. 

BY MR. HUDIS: 

Q. Mr. Fruchterman, when you examined 
Public.Resource's web site, you noticed a number 
of standards that were hosted on that web site? 

A. Correct. 

MR. KAPLAN: Objection. Vague. Asked 

and answered. 

BY MR. HUDIS: 

Q. Did you notice any restrictions on the 

ability of an Internet user to copy any of the 
standards that you saw on Public.Resource's web 
site? 

MR. KAPLAN: Objection. Vague. 

THE WITNESS: No. 
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1 

BY MR. HUDIS: 

05:40:17 

2 

Q. Did you notice any restrictions on the 

05:40:19 

3 

ability of an Internet user to download any of the 

05:40:20 

4 

standards hosted on the Public.Resource's web 

05:40:28 

5 

site? 

05:40:31 

6 

MR. KAPLAN: Objection. Vague. 

05:40:31 

7 

THE WITNESS: No. 

05:40:32 

8 

BY MR. HUDIS: 

05:40:32 

9 

Q. Did you notice any restrictions on the 

05:40:32 

10 

ability of an Internet user to print any of the 

05:40:35 

11 

standards hosted on the Public.Resource web site? 

05:40:37 

12 

MR. KAPLAN: Objection. Vague. 

05:40:40 

13 

THE WITNESS: No. 

05:40:41 

14 

MR. HUDIS: Thank you, Mr. Fruchterman. 

05:40:43 

15 

That's all I have. 

05:40:43 

16 

THE WITNESS: Okay. Thank you. 

05:40:46 

17 

MR. KAPLAN: I have no questions at this 

05:40:52 

18 

time. 

05:40:53 

19 

THE WITNESS: Okay. Oh, that's right. 

05:40:53 

20 

You get a chance, huh. 

05:40:54 

21 

THE VIDEOGRAPHER: This marks the end of 

05:40:56 

22 

the deposition of James Fruchterman. Going off 

05:40:56 

23 

the record at 5:41. 

05:40:59 

24 

(Whereupon, the deposition concluded 

05:41:00 

25 

at 5:41 p.m.) 

05:41:00 
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CERTIFICATE OF REPORTER 
I, Kathleen A. Wilkins, Certified 
Shorthand Reporter licensed in the State of 
California, License No. 10068, hereby certify that 
the deponent was by me first duly sworn, and the 
foregoing testimony was reported by me and was 
thereafter transcribed with computer-aided 
transcription; that the foregoing is a full, 
complete, and true record of proceedings. 

I further certify that I am not of 
counsel or attorney for either or any of the 
parties in the foregoing proceeding and caption 
named or in any way interested in the outcome of 
the cause in said caption. 

The dismantling, unsealing, or unbinding 
of the original transcript will render the 
reporter's certificates null and void. 

In witness whereof, I have hereunto set 
my hand this day: 

_ Reading and Signing was reguested. 

_ Reading and Signing was waived. 

_X_ Reading and Signing was not reguested. 


KATHLEEN A. WILKINS 

CSR 10068, RPR-RMR-CRR-CCRR-CLR 
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Subject: copyright infringement 
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for Educational and Psychological Testing. Without permission, the volume is posted at the following url: 

https://law.resource.org/pub/us/cfr/ibr/001/aera.standards.1999.pdf 

Please remove this unlawful posting immediately. 

Cordially, 

John Neikirk 

Director of Publications 

American Educational Research Association 

1430 K Street, NW, Suite 1200 

Washington, DC 20005 

202.238.3238 

ineikirk@aera.net 
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PUBLIC.RESOURCE.ORG ~ A Nonprofit Corporation 


Open Source America’s Operating System 

“It’s Not Just A Good Idea—It’s The Law!” 



December 19, 2013 


John Neikirk 

Director of Publications 

American Educational Research Association 

1430 K Street, NW, Suite 1200 

Washington, DC 2005 

Dear Mr. Neikerk 

I am receipt of your communication of December 16 regarding the publication of the 
AERA publication, “Standard for Educational and Psychological Testing” (1999) at 
https://law.resource.org/pub/us/cfr/ibr/001/aera.standards.1999.pdf. We are 
responsible for uploading this document. In addition, you will find this document at 
https://archive.org/details/gov.law.aera.standards.1999. 

The 1999 Edition of “Standard for Educational and Psychological Testing" was 
Incorporated by Reference by the Department of Education, Office of Postsecondary 
Education, at 34 CFR 668.148(a)(l)(iv). Incorporation by reference is not a casual affair 
and requires a carefully followed procedure by the governmental agency and the 
explicit approval of the Director of the Office of the Federal Register. 

As this standard has been incorporated into law, the standard contained in this 
document is the law of the United States, and people in the United States are 
compelled to obey it. Long-standing precedent of the United States Supreme Court 
holds that copyright claims cannot prevent citizens from reading and speaking the law. 
See Wheaton v. Peters, 33 U.S. 591 (1834); Banks v. Manchester, 128 U.S. 244 (1888). 

While the standards drafted by the American Educational Research Association, were 
entitled to copyright protection when issued, once they were incorporated into 
regulations these standards became the law, and thus have entered the public domain. 
Chief Judge Edith H. Jones of the 5th Circuit expressed this principle clearly in her 
opinion in Veeck v. Southern Building Code Congress, which concerned a model 
building code incorporated in the law of two Texas towns: 

"The issue in this en banc case is the extent to which a private organization may 
assert copyright protection for its model codes, after the models have been 
adopted by a legislative body and become “the law." Specifically, may a code- 
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Letter to AERA, Page 2 


writing organization prevent a website operator from posting the text of a 
model code where the code is identified simply as the building code of a city 
that enacted the model code as law? Our short answer is that as law, the model 
codes enter the public domain and are not subject to the copyright holder’s 
exclusive prerogatives. As model codes, however, the organization’s works 
retain their protected status." 293 F.3d 791 (5th Cir. 2002) (en banc). 


As you can see by looking at the document in question, a cover sheet has been 
prepended clearly spelling out the section of the Code of Federal Regulations that has 
incorporated by reference this document into law. Please note that we were careful to 
only publish the specific document incorporated by law. As the 1999 Edition is the one 
required by law and as it has been duly incorporated into law, we respectfully decline 
to remove this document and respectfully decline to request permission. 


Sincerely yours, 



Carl Malamud 


Digitally signed by Carl 
Malamud 

DN: cn=Carl Malamud, 
o=Public.Resource.Org 
, ou, 

email=cari@media.otg, 

c=US 

Date: 2013.12.19 
10:34:04 -08'00' 
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PUBLIC.RESOURCE.ORG ~ A Nonprofit Corporation 


Public Works for a Better Government 


American Educational Research Association., Inc. et al v. 
Public.Resource.Org, Inc., No. l:14-cv-00857 

This memorandum is in reference to the lawsuit named above, which concerns the 
document entitled “Standards for Educational and Psychological Testing" which was 
duly incorporated by the Office of the Federal Register into the Code of Federal 
Regulations and specifically in response to the stated intention to file a preliminary 
injunction motion. 

Public.Resource.Org believes firmly that because the document in question has been 
explicitly incorporated into federal law, it has the right to post it on its website, and 
that it will prevail in this case. Public Resource also believes that this case deserves the 
court’s fullest consideration, without a rush to reach an interim ruling in the absence 
of a full record. 

In order to focus this case on developing an appropriate record for a decision on the 
merits, Public.Resource.Org has voluntarily removed the document in question from 
the websites under its control and has removed the document from all public access 
on the Internet Archive. 

Until the conclusion of trial on the merits in this case, Public.Resource.Org will keep 
the document in question off of the websites under its control and will not disseminate 
the document, in whole or in part, including any revisions, and will maintain the status 
on the Internet Archive to prevent any public access to the document from the 
Archive’s websites. 

Public.Resource.Org believes that this action obviates any need for plaintiffs to rush 
the court to a judgment on a partial record of their own selection without a full 
opportunity for all parties to develop the facts and issues of the case for trial. 




Carl Malamud, President and Founder 


Date 
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UNITED STATES DISTRICT COURT 
FOR THE DISTRICT OF COLUMBIA 


AMERICAN EDUCATIONAL RESEARCH ) 

ASSOCIATION, INC., AMERICAN ) 

PSYCHOLOGICAL ASSOCIATION, INC., ) 

and NATIONAL COUNCIL ON ) 

MEASUREMENT IN EDUCATION, INC., ) Civil Action No. 1:14-cv-00857-CRC 

) 

Plaintiffs, ) DECLARATION OF MARIANNE 

) ERNESTO IN SUPPORT OF 
v. ) PLAINTIFFS’ MOTION FOR 

) SUMMARY JUDGMENT AND ENTRY 
PUBLIC.RESOURCE.ORG, INC., ) OF A PERMANENT INJUNCTION 

) 

Defendant. ) 

_ ) 

I, MARIANNE ERNESTO, declare: 

1. I am the Director, Testing and Assessment, at the American Psychological 
Association, Inc. (“APA”). I have been employed with the APA since May 2001. I submit this 
Declaration in support of the motion of the American Educational Research Association, Inc. 
(“AERA”), the APA, and the National Council on Measurement in Education, Inc. (“NCME”) 
(collectively, “Plaintiffs” or the “Sponsoring Organizations”) for a summary judgment and the 
entry of a permanent injunction. 

2. In my role as Director, Testing and Assessment, I serve as APA’s primary 
authority on all matters that relate to testing and assessment. This subject matter includes 
educational testing, clinical assessment, forensic testing and employment testing. I advocate on 


behalf of APA in matters involving federal or state legislative, regulatory or other policy issues 


concerning testing and assessment. I coordinate APA’s involvement in testing issues in matters 
such as governance, executive boards, and managerial bodies. I also manage APA’s responses to 
internal, public, member and media inquiries regarding testing issues in a manner that is 
consistent with the Standards for Educational and Psychological Testing (the “Standards”). I 
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advise, counsel and oversee the activities of the APA’s Science Directorate (and in particular its 
Office of Testing and Assessment) on policy and governance issues related to testing and 
assessments. I further serve as staff liaison to the APA’s Committee on Psychological Tests and 
Assessment (“CPTA”). Since 2001, I have served as APA’s primary contact for information 
concerning the availability and interpretation of the Standards published in 1999, and more 
recently I have done so regarding the updated Standards published in 2014. 

3. APA is a District of Columbia not-for-profit corporation. 

4. APA is the largest scientific and professional organization representing 
psychology in the United States. APA is the world’s largest association of psychologists and 
counts a vast number of researchers, educators, clinicians, consultants and students among its 
members. APA’s mission is to advance the creation, communication and application of 
psychological knowledge to benefit society and improve people’s lives. 

5. In 1954, APA prepared and published the “Technical Recommendations for 
Psychological Tests and Diagnostic Techniques.” It is my understanding that in 1955 AERA and 
NCME prepared and published a companion document entitled, “Technical Recommendations 
for Achievement Tests.” 

6. Subsequently, a joint committee of the three organizations modified, revised and 
consolidated the two documents into the first Joint Standards. Beginning with the 1966 revision, 
the three organizations (AERA, APA and NCME) collaborated in developing the “Joint 
Standards” (or simply, the “Standards”). Each subsequent revision of the Standards has been 
careful to cite the previous Standards and note that it is a revision and update of that document. 

7. Beginning in the mid-1950s, AERA, APA, and NCME formed and periodically 
reconstituted a committee of experts in psychological and educational assessment, charged with 
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the initial development of the Technical Recommendations and then each subsequent revision of 
the (renamed) Standards. These committees were formed by the Plaintiffs’ Presidents (or their 
designees), who would meet and jointly agree on the membership. Often a chair or co-chairs of 
these committees were selected by joint agreement. Beginning with the 1966 version of the 
Standards, this committee became referred to as the “Joint Committee.” 

8. Financial and operational oversight for the Standards’ revisions, promotion, 
distribution, and for the sale of the 1999 and 2014 Standards has been undertaken by a 
periodically reconstituted Management Committee, comprised of designees of the three 
Sponsoring Organizations. 

9. All members of the Joint Committee(s) and the Management Committee(s) are 
unpaid volunteers. The expenses associated with the ongoing development and publication of 
the Standards include travel and lodging expenses (for the Joint Committee and Management 
Committee members), support staff time, printing and shipment of bound volumes, and 
advertising costs. 

10. Many different fields of endeavor rely on assessments. The Sponsoring 
Organizations have ensured that the range of these fields of endeavor is represented in the Joint 
Committee’s membership - e.g., admissions, achievement, clinical counseling, educational, 
licensing-credentialing, employment, policy, and program evaluation. Similarly, the Joint 
Committee’s members, who are unpaid volunteers, represent expertise across major functional 
assessment areas - e.g., validity, equating, reliability, test development, scoring, reporting, 
interpretation, and large scale interpolation. 

11. From the time of their initial creation to the present, the preparation of and 
periodic revisions to the Standards entail intensive labor and considerable cross-disciplinary 
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expertise. Each time the Standards are revised, the Sponsoring Organizations select and arrange 
for meetings of the leading authorities in psychological and educational assessments (known as 
the Joint Committee). During these meetings, certain Standards are combined, pared down, 
and/or augmented, others are deleted altogether, and some are created as whole new individual 
Standards. The 1999 version of the Standards is nearly 200 pages and took more than five years 
to complete. 

12. The Standards, however, are not simply intended for members of the Sponsoring 
Organizations, AERA, APA, and NCME. The intended audience of the Standards is broad and 
cuts across audiences with varying backgrounds and different training. For example, the 
Standards also are intended to guide test developers, sponsors, publishers, and users by providing 
criteria for the evaluation of tests, testing practices, and the effects of test use. Test user 
standards refer to those standards that help test users decide how to choose certain tests, interpret 
scores, or make decisions based on tests results. Test users include clinical or industrial 
psychologists, research directors, school psychologists, counselors, employment supervisors, 
teachers, and various administrators who select or interpret tests for their organizations. There is 
no mechanism, however, to enforce compliance with the Standards on the part of the test 
developer or test user. The Standards, moreover, do not attempt to provide psychometric 
answers to policy or legal questions. 

13. The Standards apply broadly to a wide range of standardized instruments and 
procedures that sample an individual’s behavior, including tests, assessments, inventories, scales, 
and other testing vehicles. The Standards apply equally to standardized multiple-choice tests, 
performance assessments (including tests comprised of only open-ended essays), and hands-on 
assessments or simulations. The main exceptions are that the Standards do not apply to 
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unstandardized questionnaires ( e.g ., unstructured behavioral checklists or observational forms), 
teacher-made tests, and subjective decision processes (e.g., a teacher’s evaluation of students’ 
classroom participation over the course of a semester). 

14. The Standards have been used to develop testing guidelines for such activities as 
college admissions, personnel selection, test translations, test user qualifications, and computer- 
based testing. The Standards also have been widely cited to address technical, professional, and 
operational norms for all forms of assessments that are professionally developed and used in a 
variety of settings. The Standards additionally provide a valuable public service to state and 
federal governments as they voluntarily choose to use them. For instance, each testing company, 
when submitting proposals for testing administration, instead of relying on a patchwork of local, 
or even individual and proprietary, testing design and implementation criteria, may rely instead 
on the Sponsoring Organizations’ Standards to afford the best guidance for testing and 
assessment practices. 

15. The Standards were not created or updated to serve as a legally binding document, 
in response to an expressed governmental or regulatory need, nor in response to any legislative 
action or judicial decision. However, the Standards have been cited in judicial decisions related 
to the proper use and evidence for assessment, as well as by state and federal legislators. These 
citations in judicial decisions and during legislative deliberations occurred without any lobbying 
by the Plaintiffs. 

16. During the discovery phase of this litigation, APA located in its archives 
correspondence relating to APA’s support for proposed legislation sought to be introduced in 
2001 by Senator Paul Wellstone (D-MN) on Fairness and Accuracy in High Stakes Educational 
Decisions for Students - a suggested amendment to the Elementary and Secondary Education Act 
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(“No Child Left Behind Act”) 147 Cong. Rec. S. 4,644 (daily ed. May 9,2001). 

17. Accompanying this Declaration as Exhibit NN is a true copy of a signed 
correspondence between Ellen Garrison Ph.D. and Patricia Kobor of APA and Ms. Jill 
Momingstar, Legislative Assistant to U.S. Senator Paul Wellstone dated April 7, 2000, marked 
as Exhibit 1109 during my deposition. 

18. Accompanying this Declaration as Exhibit OO is a true copy of an unsigned 
correspondence between Ellen Garrison Ph.D. and Patricia Kobor of APA and Ms. Jill 
Momingstar, Legislative Assistant to U.S. Senator Paul Wellstone dated April 7, 2000, marked 
as Exhibit 1110 during my deposition. 

19. Accompanying this Declaration as Exhibit PP is a true copy of a signed 
correspondence between Patricia Kobor and Ellen Garrison, Ph.D. of APA and Ms. Jill 
Momingstar, Legislative Assistant to U.S. Senator Paul Wellstone dated April 13, 2000, marked 
as Exhibit 1111 during my deposition. 

20. Accompanying this Declaration as Exhibit QQ is a true copy of an unsigned 
correspondence between Raymond D. Fowler, Ph.D. of APA and an unnamed Senator dated 
May 7, 2001, marked as Exhibit 1114 during my deposition. 

21. Accompanying this Declaration as Exhibit RR is a true copy of an unsigned 
correspondence between L. Michael Honaker, Ph.D. of APA and an unnamed Senator dated 
March 6, 2001, marked as Exhibit 1115 during my deposition. 

22. Accompanying this Declaration as Exhibit SS is a true copy of a document 
containing “Highlights of APA’s Involvement in Educational Testing Provisions of the ‘No 
Child Left Behind Act’” that also contains an unsigned correspondence to an unnamed Senator 
dated May 7, 2001, marked as Exhibit 1116 during my deposition. 
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23. As noted above, many of these letters are unsigned and are not printed on APA 
letterhead. Therefore, in accordance with APA practices and protocols, it is likely that the 
unsigned letters (not printed on letterhead) were internal discussion drafts that were never sent. 

24. Regarding the signed letters that were printed on APA letterhead, they relate to 
Senator Wellstone’s proposed legislation that tests and assessments administered by the states 
are of high quality and used appropriately for the benefit of test administrators and test takers. 
These are goals that are consistent with APA policy as then reflected in the 1999 Standards. 
Even though Senator Wellstone’s amendments sought, in part, to mandate states’ compliance 
with the Standards, none of the Sponsoring Organizations actively advocated for this - and in 
any event Senator Wellstone’s proposed amendment including this language was never enacted 
into law. Accompanying this Declaration as Exhibit TT is a true copy of 20 U.S.C. § 6301, 
which is the current version of the legislation Senator Wellstone sought to amend. 

25. APA’s search of its records did not disclose any further communications with 
Congress relating to the Standards and, to the best of APA’s knowledge, it has not engaged in 
communications with Congress regarding citation of the Standards in legislation since 2001. 

26. APA has not solicited any government agency to incorporate the Standards into 
the Code of Federal Regulations or other rules of Federal or State agencies. 

27. Rather, in the policymaking arena, APA believes the Standards should be treated 
as guidelines informing the enactment of legislation and regulations consistent with best 
practices in the development and use of tests - to insure that they are valid, reliable and fair. 

28. Plaintiffs promote and sell copies of the Standards via referrals to the AERA 
website, at annual meetings, in public offerings to students, and to educational institution faculty. 
Advertisements promoting the Standards have appeared in meeting brochures, in scholarly 
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journals, and in the hallways at professional meetings. Accompanying this Declaration as 
Exhibit UU is a true copy of an advertisement for the 1999 Standards that appeared in the 
December 1999 issue of APA’s Journal of Educational Psychology. 

29. Distribution of the Standards is closely monitored by the Sponsoring 
Organizations. AERA, the designated publisher of the Standards, sometimes does provide 
promotional complementary print copies to students or professors. Except for these few 
complementary print copies, however, the Standards are not given away for free; and certainly 
they are not made available to the public by any of the three organizations for anyone to copy 
free of charge. 

30. To date, the Sponsoring Organizations have never posted, or authorized the 
posting of, a digitized copy of the 1999 Standards on any publicly accessible website. 

31. The Sponsoring Organizations do not keep any of the revenues generated from the 
sales of the Standards. Rather, the income from these sales is used by the Sponsoring 
Organizations to offset their development and production costs and to generate funds for 
subsequent revisions. This allows the Sponsoring Organizations to develop up-to-date, high 
quality Standards that otherwise would not be developed due to the time and effort that goes into 
producing them. 

32. Without receiving at least some moderate income from the sales of the Standards 
to offset their production costs and to allow for further revisions, it is very likely that the 
Sponsoring Organizations would no longer undertake to periodically update them, and it is 
unknown who else would. 

33. Due to the relative minor portion of the membership of APA who devote their 
careers to testing and assessment, it is highly unlikely that the members of APA will vote for a 
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dues increase to fund future Standards revision efforts if Public Resource successfully defends 
this case and is allowed to post the Standards online for the public to download or print for free. 
As a result, the Sponsoring Organizations would likely abandon their practice of periodically 
updating the Standards. 

34. The Joint Committee that authored the 1999 Standards comprised 16 members. 
Except for Manfred Meier (who could not be located, nor could his heirs), work made-for-hire 
letters were signed by 13 Joint Committee Members, and posthumous assignments were signed 
by the heirs of 2 deceased Joint Committee Members, vesting ownership of the copyright to the 
1999 Standards in the Sponsoring Organizations. Accompanying this Declaration as Exhibits 
VV-HHH are the 13 work made-for-hire letters signed by Eva Baker, Lloyd Bond, Daniel Goh, 
Bert Green, Edward Haertel, Jo-Ida Hansen, Suzanne Lane, Sharon Johnson-Lewis, Joseph 
Matarazzo, Pamela Moss, Esteban Olmedo, Diana Pullin, and Paul Sackett, marked as Exhibits 
1065, 1069, 1071, 1072, 1075, 1078, 1082, 1085, 1086, 1089, 1090, 1091, and 1094 during my 
deposition. Accompanying this Declaration as Exhibits III and JJJ are the posthumous 
assignments signed by the heirs of Leonard S. Feldt and Charlie Spielberger, marked as exhibits 
1070 and 1097 during my deposition. 

35. Public Resource posted Plaintiffs’ 1999 Standards to its website and the Internet 
Archive website without the permission or authorization of any of the Sponsoring Organizations. 

36. Past harm from Public Resource’s infringing activities includes misuse of 
Plaintiffs’ intellectual property without permission. 

37. Should Public Resource’s infringement be allowed to continue, the harm to the 
Sponsoring Organizations, and public at large who rely on the preparation and administration of 
valid, fair and reliable tests, includes: (i) uncontrolled publication of the 1999 Standards without 

-9- 


JA2475 


Case l:14-cv-00857-TSC Document 60-49 Filed 12/21/15 Page 10 of 10 
USCA Case #17-7035 Document #1715850 Filed: 01/31/2018 Page 172 of 517 


any notice that those guidelines have been replaced by the 2014 Standards; (ii) future 
unquantifiable loss of revenue from sales of authorized copies of the 1999 Standards (with 
proper notice that they are no longer the current version) and the 2014 Standards; and (iii) lack of 
funding for future revisions of the 2014 Standards and beyond. 


Dated: December 0>. 2015 


Marianne Ernesto 
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AS 




American 

Psychological 


Association 



April 21, 2014 

Dear Member of the Joint Committee to Revised the 1985 Standards for Educational and Psychological 
Testing : 

In 1998 and 1999, we, the American Educational Research Association ("AERA"), the American 
Psychological Association ("APA") and the National Council on Measurement in Education ("NCME"), 
(collectively, the "Publishers"), commissioned you and other leaders in the educational research, 
psychology and educational testing fields to contribute information, materials and acumen to the 
collective work entitled Standards for Educational and Psychological Testing" (the "Standards"). Since its 
publication in 1999, the Publishers had and still have the right to use the Standards in print, electronic 
format any and all other formats then known, now known or hereafter to become known. We are 
confirming in this letter that you accepted this assignment subject to the following terms and 
conditions: 

1. You delivered manuscrlpt(s) or review(s) of manuscript(s) within the time period established by 
Publishers. You were reimbursed for reasonable expenses in connection with your work on the 
Standards, if approved by the Publishers in advance, upon your submission of receipts. Your name 
appeared in the Preface of the Standards, showing that you were one of its contributors, and you 
received a free copy of the Standards upon its publication. 

2. You acknowledge that the Standards, and all contributions that you made toward completion 
and publication of the Standards, was and still Is considered a "work made for hire" within the meaning 
of the United States copyright laws, and that the Publishers own all right, title and interest in and to the 
copyright in the Standards. To the extent that the Standards were not or are not deemed to be a "work 
made for hire" you hereby assign nunc pro tunc (now for then) to the Publishers all right, title and 
interest in and to the Standards. 

3. Accordingly, the Publishers may also use the Standards for any and all uses and products, in any 
and all formats now known or hereafter to become known, including but not limited to print, recorded 
on hard storage media (e.g., CDs, DVDs, etc.), the Internet and online services. 

4. You have granted the Publishers the right to use your name in the Standards, in advertising and 
promotion related to the Standards, and in any and all ancillary products related to the Standards 
regardless of the formats in which such use occurs. 

5. Your contributions to the Standards were wholly original material not published elsewhere 
(except for material in the public domain or used with the permission of the owner), did not and does 
not infringe any copyright, and did not and does not constitute a defamation, or invasion of the right of 
privacy or publicity, or infringement of any other kind, of any third party. 

6. It is specifically understood and intended that you are an independent contractor, and nothing 
herein is intended or shall be deemed to make you an employee of any of the Publishers. 
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If the foregoing accurately sets forth our understanding, please sign and date, then scan and return this 
letter via e-mail attachment to Marianne Ernesto, Director, Testing and Assessment, American 
Psychological Association, at mernestoiSiaDa.org. 

Sincerely, 



— 


AMERICAN PSYCHOLOGICAL ASSOCIATION 




NATIONAL COUNCIL ON MEASUREMENT IN EDUCATION 


Accepted and Agreed: 

Date: 4/24/14 _ 
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April 21. 2014 

Dear Member of the Joint Committee to Revised the 1985 Standards for educational and Psychological 
Testing: 


In 1998 and 1999, we, the American Educational Research Association ("AERA"), the American 
Psychological Association ("APA") and the National Council on Measurement in Education ("NCME"), 
(collectively, the "Publishers"), commissioned you and other leaders in the educational research, 
psychology and educational testing fields to contribute information, materials and acumen to the 
collective work entitled Standards for Educational and Psychological Testing" (the "Standards") Since its 
publication in 1999, the Publishers had and still have the right to use the Standards in print, electronic 
format any and all other formats then known, now known or hereafter to become known. We are 
confirming in this letter that you accepted this assignment subject to the following terms and 
conditions; 

1. You delivered manuscript(s) or review(s) of manuscript(s) within the time period established by 
Publishers. You were reimbursed for reasonable expenses in connection with your work on the 
Standards, if approved by the Publishers in advance, upon your submission of receipts. Your name 
appeared in the Preface of the Standards, showing that you were one of Its contributors, and you 
received a free copy of the Standards upon its publication. 

2. You acknowledge that the Standards, and all contributions that you made toward completion 
and publication of the Standards, was and still is considered a "work made for hire" within the meaning 
of the United States copyright laws, and that the Publishers own all right, title and interest in and to the 
copyright in the Standards. To the extent that the Standards were not or are not deemed to be a "work 
made for hire" you hereby assign nunc pro tunc (now for then) to the Publishers all right, title and 
interest in and to the Standards. 

3. Accordingly, the Publishers may also use the Standards for any and all uses and products, in any 
and all formats now known or hereafter to become known, including but not limited to print, recorded 
on hard storage media (e.g., CDs, DVDs, etc.), the Internet and online services. 

4. You have granted the Publishers the right to use your name in the Standards, In advertising and 
promotion related to the Standards, and in any and all ancillary products related to the Standards 
regardless of the formats in which such use occurs. 

5. Your contributions to the Standards were wholly original material not published elsewhere 
(except for materia! in the public domain or used with the permission of the owner), did not and does 
not infringe any copyright, and did not and does not constitute a defamation, or invasion of the right of 
privacy or publicity, or infringement of any other kind, of any third party. 

6. It is specifically understood and intended that you are an independent contractor, and nothing 
herein is intended or shall be deemed to make you an employee of any of the Publishers, 
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If the foregoing accurately sets forth our understanding, please sign and date, then scan and return this 
letter via e-mail attachment to Marianne Ernesto, Director, Testing and Assessment, American 
Psychological Association, at mernesto@apa.org. 

Sincerely, 



AMERICAN EDUCATIONAL RESEARCH ASSOCIATION 


^ — 


AMERICAN PSYCHOLOGICAL ASSOCIATION 



NATIONAL COUNCIL ON MEASUREMENT IN EDUCATION 


Accepted and Agreed: 



Date: /D 30 1*1 
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April 21, 2014 

Dear Member of the Joint Committee to Revised the 1985 Standards for Educational and Psychological 
Testing: 

In 1998 and 1999, we, the American Educational Research Association ("AERA"), the American 
Psychological Association ("APA") and the National Council on Measurement in Education ("NCME"), 
(collectively, the "Publishers"), commissioned you and other leaders in the educational research, 
psychology and educational testing fields to contribute information, materials and acumen to the 
collective work entitled Standards for Educational and Psychological Testing" (the "Standards"). Since its 
publication in 1999, the Publishers had and still have the right to use the Standards in print, electronic 
format any and all other formats then known, now known or hereafter to become known. We are 
confirming in this letter that you accepted this assignment subject to the following terms and 
conditions: 

1. You delivered manuscript(s) or review(s) of manuscript(s) within the time period established by 
Publishers. You were reimbursed for reasonable expenses in connection with your work on the 
Standards, if approved by the Publishers in advance, upon your submission of receipts. Your name 
appeared in the Preface of the Standards, showing that you were one of its contributors, and you 
received a free copy of the Standards upon its publication. 

2. You acknowledge that the Standards, and all contributions that you made toward completion 
and publication of the Standards, was and still Is considered a "work made for hire" within the meaning 
of the United States copyright laws, and that the Publishers own all right, title and interest in and to the 
copyright in the Standards. To the extent that the Standards were not or are not deemed to be a "work 
made for hire" you hereby assign nunc pro tunc (now for then) to the Publishers all right, title and 
interest in and to the Standards. 

3. Accordingly, the Publishers may also use the Standards for any and all uses and products, in any 
and all formats now known or hereafter to become known, Including but not limited to print, recorded 
on hard storage media (e.g., CDs, DVDs, etc.), the Internet and online services. 

4. You have granted the Publishers the right to use your name in the Standards, in advertising and 
promotion related to the Standards, and in any and all ancillary products related to the Standards 
regardless of the formats in which such use occurs. 

5. Your contributions to the Standards were wholly original material not published elsewhere 
(except for material in the public domain or used with the permission of the owner), did not and does 
not infringe any copyright, and did not and does not constitute a defamation, or invasion of the right of 
privacy or publicity, or infringement of any other kind, of any third party. 

6. It Is specifically understood and intended that you are an independent contractor, and nothing 
herein is intended or shall be deemed to make you an employee of any of the Publishers. 
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If the foregoing accurately sets forth our understanding, please sign and date, then scan and return this 
letter via e-mail attachment to Marianne Ernesto, Director, Testing and Assessment, American 
Psychological Association, at mernestolSapa.org. 

Sincerely, 



AMERICAN EDUCATIONAL RESEARCH ASSOCIATION 

AMERICAN PSYCHOLOGICAL ASSOCIATION 


NATIONAL COUNCIL ON MEASUREMENT IN EDUCATION 




Accepted and Agreed: 


Date: 4 - g ~ 2.4 
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April Jl.JOU 

Dear Member of dis joint Committee to Revised the 1985 SrandpndS/or ftfucotJonof and Psythofogicof 
Testing: 


In 1398 end 1999. we. the American Educational Research Association rAEftA*). the American 
Psychological Association 1‘APA*) and the National Council on Measurement in Education (‘NCME'I, 

psychology and educational Ustlni fields to contribute Information, materials and acumen to the 
collective work entitled Standards tor Educational and Psychological Testing (tt|a -Standards*). Since Ns 
publication m 1999, the Publishers had and still have the right co use the Standards In (Hint, electronic 
format any and ad other formats then known, now known or hereafter to become known. We are 
conflimini In this letter that you Kcepied this assignment subject ro the following terms and 

1. you delivered maruscrlpt(t) o< revlew(s) of manuscrlpt(s) within the dme period established by 
Publishers. You were reimbursed for reasonable eapenies in connection with your work on the 

appeared In the Prefece ol the Standards, showing that you were oneol Its contributors, and you 
received »free copy ol the Standard! upon Its publication. 

2. . You acknowledge that the Standards, and all contributions that you made toward completion 

and publication of the Standards, was and still Is considered a 'work nude for hire* within the meaning 
of the Untied States copyright lews, and that the Publishers own ad right, INlewid Interest In and to the 
copyright In the Standards. To the eatent that the Standards were not or ere not deemed to be a 'work 
made for hire* you hereby ea sign mine pro tunefnow lor then) to the Publishers ell right, title end 
interest In end to the Standards 

.3. Accordingly, the Publishers may also use the Standards for any and all uses and products. In any 
and all tprmaa now known or hereafter to become known, Including but not limited to print, recorded 
on hard storage madia le g., CDs, DVDs; etc.), the Internet end online services. 



regardless oldie formats in which such usa occurs 


S. Your contributions to the Standards were wholly original materiel not published elsewhere 
(etcept for material In the public domain or used with the permission of die owner), did not and does 
not Infringe any copyright, and did not and does not constitute e defamation, or Invasion of the light of 
prttiecyew publicity, or Infringement of any other kind, of any third party. 


u. tt is speciftcagy understood End intandad that you are an Independent contractor, and nothing 
herein U Intended or shall he deemed to make you an employee of any of die Publishers 
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April 21, 2014 

Dear Member of the Joint Committee to Revised the 1985 Standards for Educational and Psychological 
Testing: 

In 1998 and 1999, we, the American Educational Research Association ("AERA"), the American 
Psychological Association ("APA") and the National Council on Measurement in Education ("NCME"), 
(collectively, the "Publishers"), commissioned you and other leaders in the educational research, 
psychology and educational testing fields to contribute information, materialsand acumen to the 
collective work entitled Standards for Educational and Psychological Testing* (the "Standards"). Since its 
publication in 1999, the Publishers had and still have the right to use the Standards in print, electronic 
format any.and all other formats thien known, now known or hereafter to become known. We are 
confirming in"this letter that you accepted this assignment subject to the following terms and 
conditions: 

1. You delivered manuscript(s) or review(s) of manuscript(s) within the time period established by 
Publishers, You were reimbursed for reasonable expenses in connection with your work on the 
Standards, if approved by the Publishers in advance, upon your submission of receipts. Your name 
appeared in the Preface of the Standards, showing that you were one of its contributors, and you 
received a free copy of the Standards upon its publication. 

2. You acknowledge that the Standards, and all contributions that you made toward completion 
and publication of the standards, was and still is considered a "work made for hire" within the meaning 
ofthe UnitecLStates copyright laws, and that the Publishers own all right, title and interest in and to the 
copyrightfn the Standards, To the extent that the Standards were not or are not deemed to be a-"work 
made tor hire" you hereby assign nunc pro tunc (now for then) to the Publishers all right, title and 
interest in and to the Standards. 

3. Accordingly, the Publishers may also use the Standards for any and all uses and products, In-any 
and all formats now-known or hereafter to become known, including but not limited to print, recorded 
on hard storage media (e.g., CDs; DVDs, etc.), the Internet and online services. 

4. You have granted the Publishers the right to use your name in the Standards, in advertising and 
promotion related to the Standards, and in any and all ancillary products related to the Standards 
regardless of the formats in which such use occurs. 

5. Your contributions to the Standards were wholly original material not published elsewhere 
(except for material in the public domain or used with the permission of the owner), did not and does 
not infringe any copyright, and did not and does not constitute a defamation, or invasion of the right of 
privacy or publicity, or infringement of any other kind, of any third party. 

6. It is specifically understood aQd intended thahyou are an independent contractor, and nothing 
herein is intended or shall be deemed to make you an employee of any of the Publishers. 
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If the foregoing accurately sets forth our understanding, please sign and date, then scan and return this 
letter via e-mall attachment to Marianne Ernesto, Director, Testing and Assessment, American 
Psychological Association, at mernesto(gapa.org. 

Sincerely, 



AMERICAN EDUCATIONAL RESEARCH ASSOCIATION 

AMERICAN PSYCHOLOGICAL ASSOCIATION 



NATIONAL COUNCIL ON MEASUREMENT IN EDUCATION 


Accepted and Agreed: 


- 
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April 21, 2014 

Dear Member of the Joint Committee to Revise^ the 1985 Standards for Educational and Psychological 
Testing: 


In 1998 and 1999, we, the American Educational Research Association ("AERA"), the American 
Psychological Association ("APA") and the National Council on Measurement in Education ("NCME"), 
(collectively, the "Publishers"), commissioned you and other leaders in the educational research, 
psychology and educational testing fields to contribute information, materials and acumen to the 
collective work entitled Standards for Educational and Psychological Testing" (the "Standards"). Since its 
publication in 1999, the Publishers had and still have the right to use the Standards in print, electronic 
format any and all other formats then known, now known or hereafter to become known. We are 
confirming in this letter that you accepted this assignment subject to the following terms and 
conditions: 

1. You delivered manuscript(s) or review(s) of manuscript(s) within the time period established by 
Publishers. You were reimbursed for reasonable expenses in connection with your work on the 
Standards, if approved by the Publishers in advance, upon your submission of receipts. Your name 
appeared in the Preface of the Standards, showing that you were one of its contributors, and you 
received a free copy of the Standards upon Its publication. 

2. You acknowledge that the Standards, and all contributions that you made toward completion 
and publication of the Standards, was and still is considered a "work made for hire" within the meaning 
of the United States copyright laws, and that the Publishers own all right] title and interest in and to the 
copyright in the Standards. To the extent that the Standards were not or are not deemed to be a "work 
made for hire" you hereby assign nunc pro tunc (now for then) to the Publishers all right, title and 
interest in and to the Standards. 

3. Accordingly, the Publishers may also use the Standards for any and all uses and products, in any 
and all formats now known or hereafter to become known, including but not limited to print, recorded 
on hard storage media (e.g., CDs, DVDs, etc.), the Internet and online services. 

4. You have granted the Publishers the right to use your name in the Standards, in advertising and 
promotion related to the Standards, and in any and all ancillary products related to the Standards 
regardless of the formats in which such use occurs. 

5. Your contributions to the Standards were wholly original material not published elsewhere 
(except for material in the public domain or used with the permission of the owner), did not and does 
not infringe any copyright, and did not and does not constitute a defamation, or invasion of the right of 
privacy or publicity, or infringement of any other kind, of any third party. 

6. It is specifically understood and Intended that you are an Independent contractor, and nothing 
herein is intended or shall be deemed to make you an employee of any of the Publishers. 
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If the foregoing accurately sets forth our understanding, please sign and date, then scan and return this 
letter via e-mail attachment to Marianne Ernesto, Director, Testing and Assessment, American 
Psychological Association, at mernesto@apa.ore. 

Sincerely, 



American psychological association 




NATIONAL COUNCIL ON MEASUREMENT IN EDUCATION 


Accepted and Agreed: 

hlA^UV-to- 
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April 21, 2014 

Dear Member of the Joint Committee to Revised the 1985 Standards for Educational and Psychological 
Testing: 


In 1998 and 1999, we, the American Educational Research Association ("AERA"), the American 
Psychological Association ("APA") and the National Council on Measurement in Education ("NCME"), 
(collectively, the "Publishers"), commissioned you and other leaders in thie educational research, 
psychology and educational testing fields to contribute information, materials and acumen to the 
collective work entitled Standards for Educational and Psychological Test ing" (the "Standards") Since its 
publication in 1999, the Publishers had and still have the right to use the Standards in print, electronic 
format any and all other formats then known, now known or hereafter to become known. We are 
confirming In this letter that you accepted this assignment subject to the following terms and 
conditions: 

1. You delivered manuscripts) or review(s) of manuscript(s) within the time period established by 
Publishers. You were reimbursed for reasonable expenses in connection with your work on the 
Standards, If approved by the Publishers in advance, upon your submission of receipts. Your name 
appeared in the Preface of the Standards, showing that you were one of its contributors, and you 
received a free copy of the Standards upon its publication. 

2. You acknowledge that the Standards, and ail contributions that you made toward completion 
and publication of the Standards, was and still Is considered a "work made for hife" within the meaning 
of the United States copyright laws, and that the Publishers own ail right, title and interest in and to the 
copyright in the Standards. To the extent that the Standards were not or are not deemed to be a "work 
made for hire" you hereby assign nunc pro tunc (now for then) to the Publishers all right, title and 
interest in and to the Standards. 

3. Accordingly, the Publishers may also use the Standards for any and all uses and products, in any 
and all formats now known or hereafter to become known, including but not limited to print, recorded 
on hard storage media (e.g., CDs, DVDs, etc.), the Internet and online sen/ices. 

4. You have granted the Publishers the right to use your name in the Standards, in advertising and 
promotion related to the Standards, and in any and all ancillary products related to the Standards 
regardless of the formats in which such use occurs. 

5. Your contributions to the Standards were wholly original material not published elsewhere 
(except for material in the public domain or used with the permission of the owner), did not and does 
not infringe any copyright, and did not and does not constitute a defamation, or invasion of the right of 
privacy or publicity, or infringement of any other kind, of any third party. 

6. It is specifically understood and intended that you are an independent contractor, and nothing 
herein is intended or shall be deemed to make you an employee of any olfthe Publishers. 
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If the foregoing accurately sets forth our understanding, please sign and date, then scan and return this 
letter via e-mail attachment to Marianne Ernesto, Director, Testing and Assessment, American 
Psychological Association, at mernesto@apa.ore. 

Sincerely, 



AMERICAN EDUCATIONAL RESEARCH ASSOCIATION 


AMERICAN PSYCHOLOGICAL ASSOCIATION 

NATIONAL COUNCIL ON MEASUREMENT IN EDUCATION 



Accepted and Agreed; 
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April 21, 2014 

Dear Member of the Joint Committee to Revised the 1985 
Standards for Educational and Psychological Testing : 

In 1998 and 1999, we, the American Educational Research 
Association (“AERA”), the American Psychological Association 
(“APA”) and the National Council on Measurement in 
Education (“NCME”), (collectively, the “Publishers”), 
commissioned you and other leaders in the educational research, 
psychology and educational testing fields to contribute 
information, materials and acumen to the collective work 
entitled Standards for Educational and Psychological Testing” 
(the “Standards”). Since its publication in 1999, the Publishers 
had and still have the right to use the Standards in print, 
electronic format any and all other formats then known, now 
known or hereafter to become known. We are confirming in this 
letter that you accepted this assignment subject to the following 
terms and conditions: 

1. You delivered manuscript(s) or review(s) of 

manuscript(s) within the time period established by Publishers. 
You were reimbursed for reasonable expenses in connection 
with your work on the Standards, if approved by the Publishers 
in advance, upon your submission of receipts. Your name 
appeared in the Preface of the Standards, showing that you were 
one of its contributors, and you received a free copy of the 
Standards upon its publication. 
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2. You acknowledge that the Standards, and all 
contributions that you made toward completion and publication 
of the Standards, was and still is considered a “work made for 
hire” within the meaning of the United States copyright laws, 
and that the Publishers own all right, title and interest in and to 
the copyright in the Standards. To the extent that the Standards 
were not or are not deemed to be a “work made for hire” you 
hereby assign nunc pro tunc (now for then) to the Publishers all 
right, title and interest in and to the Standards. 

3. Accordingly, the Publishers may also use the Standards 
for any and all uses and products, in any and all formats now 
known or hereafter to become known, including but not limited 
to print, recorded on hard storage media (e.g., CDs, DVDs, etc.), 
the Internet and online services. 

4. You have granted the Publishers the right to use your 
name in the Standards, in advertising and promotion related to 
the Standards, and in any and all ancillary products related to the 
Standards regardless of the formats in which such use occurs. 

5. Your contributions to the Standards were wholly original 
material not published elsewhere (except for material in the 
public domain or used with the permission of the owner), did not 
and does not infringe any copyright, and did not and does not 
constitute a defamation, or invasion of the right of privacy or 
publicity, or infringement of any other kind, of any third party. 

6. It is specifically understood and intended that you are an 
independent contractor, and nothing herein is intended or shall 
be deemed to make you an employee of any of the Publishers. 

If the foregoing accurately sets forth our understanding, please 
sign and date, then scan and return this letter via e-mail 
attachment to Marianne Ernesto, Director, Testing and 
Assessment, American Psychological Association, at 
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Sincerely, 



AMERICAN EDUCATIONAL RESEARCH 
ASSOCIATION 

AMERICAN PSYCHOLOGICAL ASSOCIATION 








NATIONAL COUNCIL ON MEASUREMENT IN 
EDUCATION 


Accepted and Agreed: 


Dale: 
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April 21, 2014 

Dear Member of the Joint Committee to Revised the 1985 Standards for Educational and Psychological 
Testing: 


In 1998 and 1999, we, the American Educational Research Association ("AERA"), the American 
Psychological Association ("APA") and the National Council on Measurement in Education ("NCME"), 
(collectively, the "Publishers"), commissioned you and other leaders in the educational research, 
psychology and educational testing fields to contribute information, materials and acumen to the 
collective work entitled Standards for Educational and Psychological Testing" (the "Standards"). Since its 
publication in 1999, the Publishers had and still have the right to use the Standards in print, electronic 
format any and all other formats then known, now known or hereafter to become known. We are 
confirming in this letter that you accepted this assignment subject to the following terms and 
conditions: 

1. You delivered manuscript(s) or revfew(s) of manuscript(s) within the time period established by 
Publishers. You were reimbursed for reasonable expenses in connection with your work on the 
Standards, if approved by the Publishers in advance, upon your submission of receipts. Your name 
appeared in the Preface of the Standards, showing that you were one of its contributors, and you 
received a free copy of the Standards upon its publication. 

2. You acknowledge that the Standards, and ail contributions that you made toward completion 
and publication of the Standards, was and still is considered a "work made for hire" within the meaning 
of the United States copyright laws, and that the Publishers own all right, title and interest in and to the 
copyright in the Standards. To the extent that the Standards were not or are not deemed to be a "work 
made for hire" you hereby assign nunc pro tunc (now for then) to the Publishers all right, title and 
interest in and to the Standards. 

3. Accordingly, the Publishers may also use the Standards for any and all uses and products, in any 
and all formats now known or hereafter to become known, including but not limited to print, recorded 
on hard storage media (e.g., CDs, DVDs, etc.), the Internet and online services. 

4. You have granted the Publishers the right to use your name in the Standards, in advertising and 
promotion related to the Standards, and in any and all ancillary products related to the Standards 
regardless of the formats in which such use occurs. 

5. Your contributions to the Standards were wholly original material not published elsewhere 
(except for material in the public domain or used with the permission of the owner), did not and does 
not infringe any copyright, and did not and does not constitute a defamation, or invasion of the right of 
privacy or publicity, or infringement of any other kind, of any third party. 

6. It is specifically understood and intended that you are an independent contractor, and nothing 
herein is intended or shall be deemed to make you an employee of any of the Publishers. 
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If the foregoing accurately sets forth our understanding, please sign and date, then scan and return this 
letter via e-mail attachment to Marianne Ernesto, Director, Testing and Assessment, American 
Psychological Association, at mernestoiaapa.org. 

Sincerely, 





AMERICAN PSYCHOLOGICAL ASSOCIATION 




NATIONAL COUNCIL ON MEASUREMENT IN EDUCATION 


Accepted and Agreed: 




Date: April 28, 2014 
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April 21, 2014 

Dear Member of the Joint Committee to Revised the 1985 Standards for Educational and Psychological 
Testing: 


In 1998 and 1999, we, the American Educational Research Association ("AERA"), the American 
Psychological Association ("APA") and the National Council on Measurement in Education ("NCME"), 
(collectively, the "Publishers"), commissioned you and other leaders in the educational research, 
psychology and educational testing fields to contribute information, materials and acumen to the 
collective work entitled Standards for Educational and Psychological Testing" (the "Standards"), Since 
its publication In 1999, the Publishers had and still have the right to use the Standards in print, 
electronic format any and all otherformats then known, now known or hereafter to become known. We 
are confirming in this letter that you accepted this assignment subject to the following terms and 
conditions: 

1. You delivered manuscrlpt(s) or review(s) of manuscript(s) within the time period established by 
Publishers. You were reimbursed for reasonable expenses in connection with your work on the 
Standards, if approved by the Publishers in advance, upon your submission of receipts. Your name 
appeared in the Preface of the Standards, showing that you were one of its contributors, and you 
received a free copy of the Standards upon its publication. 

2. You acknowledge that the Standards, and all contributions that you made toward completion 
and publication of the Standards, was and still is considered a "work made for hire" within the meaning 
of the United States copyright laws, and that the Publishers own all right, title and interest in and to the 
copyright in the Standards. To the extent that the Standards were not or are not deemed to be a "work 
made for hire" you hereby assign nunc pro tunc (now for then) to the Publishers all right, title and 
interest in and to the Standards. 

3. Accordingly, the Publishers may also use the Standards for any and all uses and products, in any 
and all formats now known or hereafter to become known, including but not limited to print, recorded 
on hard storage media (e.g., CDs, DVDs, etc.), the Internet and online services. 

4. You have granted the Publishers the right to use your name in the Standards, in advertising and 
promotion related to the Standards, and In any and all ancillary products related to the Standards 
regardless of the formats in which such use occurs. 

5. Your contributions to the Standards were wholly original material not published elsewhere 
(except for material in the public domain or used with the permission of the owner), did not and does 
not infringe any copyright, and did not and does not constitute a defamation, or invasion of the right of 
privacy or publicity, or infringement of any other kind, of any third party. 

6. It is specifically understood and intended that you are an independent contractor, and nothing 
herein is intended or shall be deemed to make you an employee of any of the Publishers. 


JA2507 


AERA_APA_NCME_0031461 




Case l:14-cv-00857-TSC DoajmenPSo-67 Filed 12/21/15 Page 3 of 3 
USCA Case #17-7035 Document #1715850 Filed: 01/31/2018 Page 204 of 517 


If the foregoing accurately sets forth our understanding, please sign and date, then scan and return 
this letter via e-mail attachment to Marianne Ernesto, Director, Testing and Assessment, American 
Psychological Association, at mernestotaapa.org. 

Sincerely, 



AMERICAN EDUCATIONAL RESEARCH ASSOCIATION 


^ 2 ^ 


AMERICAN PSYCHOLOGICAL ASSOCIATION 




NATIONALCOUNQLON MEASUREMENT IN EDUCATION 
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EXHIBIT FFF 
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April 21, 2014 

Dear Member ot the Joint Committee to Revised the 1985 Standards for Educational and Psychological 
Testing: 

In 1998 and 1999. we, the American Educational Research Association ("AERA"), the American 
Psychological Association ("APA") and the National Council on Measurement In Education ("NCME''), 
{collectively, the "Publishers"), commissioned yooantf otberleaders In the- educational research, 
psychology and educational testing fields to contribute information, materials and acumen to the 
collective work entitled Standards for Educational and Psychological Testing" (the "Standards"). Since 
its publication in 1999, the Publishers had and still have the right to use the Standards in print, 
electronic format any and all other formats then known, now known or he reafter to become known. We 
are confirming In this letter that you accepted this assignment subject to the following terms and 
conditions: 

1. You delivered manuscript(s) or review(s) of manuscript(s) within the time period established by 
Publishers. You were reimbursed for reasonable expenses In connection with your work on the 
Standards, if approved by the Publishers in advance, upon your submission of receipts. Your name 
appeared in the Preface of the Standards, showing that you were one of its contributors, and you 
received a free copy of the Standards upon its publication. 

2. You acknowledge that the Standards, and all contributions that you made toward completion 
and publication of the Standards, was and still is considered a "work mads: for hire" within the meaning 
of the United States copyright laws, and that the Publishers own all right, title and interest in and to the 
copyright in the Standards, lo the extent that the Standards were not or aire not deemed to be a "work 
made for hire" you hereby assign nunc pro tunc (now for then) to the Publishers all right, title and 

interest in and to the Standards. 

3. Accordingly, the Publishers may also use the Standards for any and all uses and products, in any 
and all formats now known or hereafter to become known, including but not limited to print, recorded 
on hard storage media (e.g., CDs, UVDs, etc,}, the Internet and online services. 

4. You have granted the Publishers the right to use your name in the Standards, in advertising and 
promotion related to the Standards, and in any and all ancillary products related to the Standards 
regardless of the formats in which such use occurs. 

5 Ynur contributions to the Standards were wholly original material not published elsewhere 
(except for material in the public domain or used with the permission of the owner), did not and does 
not infringe any copyright, and did not and does not constitute a defamation, or invasion of the right of 
privacy or publicity, or infringement of any other kind, of any third party. 

6. It Is specifically understood and intended that you are an independent contractor, and nothing 
herein is intended or shall be deemed to make you an employee Of any of the Publishers. 
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If the foregoing accurately sets forth our understanding, please sign and date, then scan and return 
this letter via e-mail attachment to Marianne Ernesto, Director, Testing and Assessment, American 
Psychological Association, at mcrnettofiSapa.org. 

Sincerely, 



AMERICAN EOUCATIONAL RESEARCH ASSOCIATION 


^ — 


AMERICAN PSYCHOLOGICAL ASSOCIATION 



NATIONAL COUNCIL ON MEASUREMENT IN EDUCATION 


Accepted and Agreed: 



Date: - 1 ■ 
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EXHIBIT GGG 

Case No. l:14-cv-00857-TSC-DAR 
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April 21, 2014 

Dear Member of the Joint Committee to Revised the 1985 Standards for Educational and Psychological 
Testing : 

In 1998 and 1999, we, the American Educational Research Association ("AERA"), the American 
Psychological Association ("APA") and the National Council on Measurement In Education ("NCME"), 
(collectively, the "Publishers"), commissioned you and other leaders in the educational research, 
psychology and educational testing fields to contribute information, materials and acumen to the 
collective work entitled Standards for Educational and Psychological Testing" (the "Standards"). Since its 
publication in 1999, the Publishers had and still have the right to use the Standards in print, electronic 
format any and ail other formats then known, now known or hereafter to become known. We are 
confirming in this letter that you accepted this assignment subject to the following terms and 
conditions: 

1. You delivered manuscript(s) or review(s) of manuscript(s) within the time period established by 
Publishers. You were reimbursed for reasonable expenses in connection with your work on the 
Standards, if approved by the Publishers in advance, upon your submission of receipts. Your name 
appeared In the Preface of the Standards, showing that you were one of its contributors, and you 
received a free copy of the Standards upon its publication. 

2. You acknowledge that the Standards, and all contributions that you made toward completion 
and publication of the Standards, was and still is considered a "work made for hire" within the meaning 
of the United States copyright laws, and that the Publishers own all right, title and interest in and to the 
copyright in the Standards. To the extent that the Standards were not or are not deemed to be a "work 
made for hire" you hereby assign nunc pro tunc (now for then) to the Publishers all right, title and 
interest in and to the Standards. 

3. Accordingly, the Publishers may also use the Standards for any and all uses and products, in any 
and all formats now known or hereafter to become known, including but not limited to print, recorded 
on hard storage media (e.g., CDs, DVDs, etc.), the Internet and online services. 

4. You have granted the Publishers the right to use your name in the Standards, in advertising and 
promotion related to the Standards, and in any and all ancillary products related to the Standards 
regardless of the formats in which such use occurs. 

5. Your contributions to the Standards were wholly original material not published elsewhere 
(except for material in the public domain or used with the permission of the owner), did not and does 
not infringe any copyright, and did not and does not constitute a defamation, or Invasion of the right of 
privacy or publicity, or infringement of any other kind, of any third party. 

6. It is specifically understood and intended that you are an independent contractor, and nothing 
herein is intended or shall be deemed to make you an employee of any of the Publishers. 
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If the foregoing accurately sets forth our understanding, please sign and date, then scan and return this 
letter via e-mall attachment to Marianne Ernesto, Director, Testing and Assessment, American 
Psychological Association, at mernesto(fi>apa.ore 

Sincerely, 



AMERICAN EDUCATIONAL RESEARCH ASSOCIATION 


AMERICAN PSYCHOLOGICAL ASSOCIATION 




NATIONAL COUNCIL ON MEASUREMENT IN EDUCATION 


Accepted and Agreed: 
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April 21, 2014 

Dear Member of the Joint Committee to Revised the 1985 Standards for Educational and Psychological 
Testing: 


In 1998 and 1999, we, the American Educational Research Association ("AERA"), the American 
Psychological Association ("APA") and the National Council on Measurement in Education ("NCME"), 
(collectively, the "Publishers"), commissioned you and other leaders in the educational research, 
psychology and educational testing fields to contribute information, materials and acumen to the 
collective work entitled Standards for Educational and Psychological Testing" (the "Standards"). Since its 
publication in 1999, the Publishers had and still have the right to use the Standards in print, electronic 
format any and all other formats then known, now known or hereafter to become known. We are 
confirming in this letter that you accepted this assignment subject to the following terms and 
conditions: 

1. You delivered manuscript(s) or review(s) of manuscrlpt(s) within the time period established by 
Publishers. You were reimbursed for reasonable expenses in connection with your work on the 
Standards, if approved by the Publishers in advance, upon your submission of receipts. Your name 
appeared in the Preface of the Standards, showing that you were one of its contributors, and you 
received a free copy of the Standards upon its publication. 

2. You acknowledge that the Standards, and all contributions that you made toward completion 
and publication of the Standards, was and still is considered a "work made for hire" within the meaning 
of the United States copyright laws, and that the Publishers own all right, title and interest in and to the 
copyright in the Standards. To the extent that the Standards were not or are not deemed to be a "work 
made for hire" you hereby assign nunc pro tunc (now for then) to the Publishers all right, title and 
interest in and to the Standards. 

3. Accordingly, the Publishers may also use the Standards for any and all uses and products, in any 
and all formats now known or hereafter to become known, Including but not limited to print, recorded 
on hard storage media (e.g., CDs, DVDs, etc.), the Internet and online services. 

4. You have granted the Publishers the right to use your name in the Standards, in advertising and 
promotion related to the Standards, and in any and all ancillary products related to the Standards 
regardless of the formats in which such use occurs. 

5. Your contributions to the Standards were wholly original material not published elsewhere 
(except for material in the public domain or used with the permission of the owner), did not and does 
not infringe any copyright, and did not and does not constitute a defamation, or invasion of the right of 
privacy or publicity, or infringement of any other kind, of any third party. 

6. It is specifically understood and intended that you are an independent contractor, and nothing 
herein is intended or shall be deemed to make you an employee of any of the Publishers. 
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if the foregoing accurately sets forth our understanding, please sign and date, then scan and return this 
letter via e-mail attachment to Marianne Ernesto. Director, Testing and Assessment, American 
Psychological Association, at mernesto(S>apa.ore. 

Sincerely, 



AMERICAN EDUCATIONAL RESEARCH ASSOCIATION 

AMERICAN PSYCHOLOGICAL ASSOCIATION 

NATIONAL COUNCIL ON MEASUREMENT IN EDUCATION 

Accepted and Agreed: 



D...: 
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UNITED STATES DISTRICT COURT 
FOR THE DISTRICT OF COLUMBIA 


AMERICAN EDUCATIONAL RESEARCH ) 

ASSOCIATION, INC., AMERICAN ) 

PSYCHOLOGICAL ASSOCIATION, INC., ) 

and NATIONAL COUNCIL ON ) 

MEASUREMENT IN EDUCATION, INC., ) 

) 

Plaintiffs, ) 

) 

v. ) 

) 

PUBLIC.RESOURCE.ORG, INC., ) 

) 

Defendant. ) 

_ ) 

I, LAURESS L. WISE, declare: 

1. I am the Immediate Past President of the National Council on Measurement in 
Education, Inc. (“NCME”). I have been a member of this organization for approximately 30 
years. 1 previously was the President of NCME from April 2014 through April 2015, and Vice 
President of this organization from April 2013 through April 2014. I submit this Declaration in 
support of the motion of the American Educational Research Association, Inc. (“AERA”), the 
American Psychological Association, Inc. (“APA”), and the NCME (collectively, ‘‘Plaintiffs" or 
“Sponsoring Organizations”) for summary judgment and the entry of a permanent injunction. 

2. I also am a principal scientist with the Human Resources Research Organization 
(“HumRRO”), spending full time on research and evaluation projects relating to educational 
measurement. I previously served as HumRRO CEO for 13 years, combining management and 
research activities and, before that, directed research and development for the Armed Services 
Vocational Aptitude Battery for the Department of Defense. Before that 1 spent 16 years as a 
researcher for the American Institutes for Research, rising to the position of Director of 
Research. I am also a member of both AERA and APA. 


Civil Action No. l:14-cv-00857-TSC-DAR 

DECLARATION OF LAURESS 
L. WISE IN SUPPORT OF 
PLAINTIFFS’ MOTION FOR 
SUMMARY JUDGMENT AND ENTRY 
OF A PERMANENT INJUNCTION 
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3. NCME is a District of Columbia not-for-profit corporation. 

4. NCME is a professional organization for individuals involved in assessment, 
evaluation, testing, and other aspects of educational measurement. NCME’s members are 
involved in the construction and use of standardized tests; new forms of assessment, including 
performance-based assessment; program design; and program evaluation. 

5. In 1955, AERA and NCME prepared and published a companion document to 
APA’s “Technical Recommendations for Psychological Tests and Diagnostic Techniques” 
(published in 1954), entitled “Technical Recommendations for Achievement Tests.” 

6. Subsequently, a joint committee of the three organizations modified, revised and 
consolidated the two documents into the first Joint Standards. Beginning with the 1966 revision, 
the Sponsoring Organizations collaborated in developing the “Joint Standards” (or simply, the 
“Standards”). Each subsequent revision of the Standards has been careful to note that it is a 
revision and update of that document. 

7. Beginning in the mid-1950s, the Sponsoring Organizations formed and 
periodically reconstituted a committee of experts in psychological and educational assessment, 
charged with the initial development of the Technical Recommendations and then each 
subsequent revision of the (renamed) Standards. These committees were formed by the three 
organizations' Presidents (or their designees), who would meet and jointly agree on the 
membership. Often a chair or co-chairs of these committees were selected by joint agreement. 
Beginning with the 1966 version of the Standards, this committee became referred to as the 
“Joint Committee.” For example, I was the co-chair of the Joint Committee for the 2014 edition 
of the Standards. 

8. Financial and operational oversight for the Standards’ revisions, promotion, 
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distribution, and for the sale of the 1999 and 2014 Standards has been undertaken by a 
periodically reconstituted Management Committee, comprised of designees of the three 
Sponsoring Organizations. 

9. All members of the Joint Committee(s) and the Management Committee(s) are 
unpaid volunteers. The expenses associated with the ongoing development and publication of 
the Standards include travel and lodging expenses (for the Joint Committee and Management 
Committee members), support staff time, printing and shipment of bound volumes, and 
advertising costs. 

10. Many different fields of endeavor rely on assessments. The Sponsoring 
Organizations have ensured that the range of these fields of endeavor is represented in the Joint 
Committee’s membership - e.g., admissions, achievement, clinical counseling, educational, 
licensing-credentialing, employment, policy, and program evaluation. Similarly, the Joint 
Committee’s members represent expertise across major functional assessment areas - e.g., 
validity, equating, reliability, test development, scoring, reporting, interpretation, large scale 
interpolation and cognitive behavioral therapy. 

11. From tire time of their initial creation to the present, the preparation and periodic 
revisions to the Standards entail intensive labor and considerable cross-disciplinary expertise. 
Each time the Standards are revised, the Sponsoring Organizations select and arrange for 
meetings of the leading authorities in psychological and educational assessments (known as the 
Joint Committee). During these meetings, certain Standards are combined, pared down, and/or 
augmented, others are deleted altogether, and some are created as whole new individual 
Standards. Tire 1999 version of the Standards is nearly 200 pages, and took more than five years 
to complete - resulting from work put in by the Joint Committee to generate a set of best 
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practices on educational and psychological testing that are respected and relied upon by leaders 
in their fields. 

12. The Standards originally were created as principles and guidelines - a set of best 
practices to improve professional practice in testing and assessment across multiple settings, 
including education and various areas of psychology. The Standards can and should be used as a 
recommended course of action in the sound and ethical development and use of tests, and also to 
evaluate the quality of tests and testing practices. Additionally, an essential component of 
responsible professional practice is maintaining technical competence. Many professional 
associations also have developed standards and principles of technical practice in assessment. 
The Sponsoring Organizations’ Standards have been and still are used for this purpose. 

13. The Standards, however, are not simply intended for members of the Sponsoring 
Organizations, AERA, APA, and NOME. The intended audience of the Standards is broad and 
cuts across audiences with varying backgrounds and different training. For example, the 
Standards also are intended to guide test developers, sponsors, publishers, and users by providing 
criteria for the evaluation of tests, testing practices, and the effects of test use. Test user 
standards refer to those standards that help test users decide how to choose certain tests, interpret 
scores, or make decisions based on tests results. Test users include clinical or industrial 
psychologists, research directors, school psychologists, counselors, employment supervisors, 
teachers, and various administrators who select or interpret tests for their organizations. There is 
no mechanism, however, to enforce compliance with the Standards on the part of the test 
developer or test user. The Standards, moreover, do not attempt to provide psychometric 
answers to policy or legal questions. 

14. The Standards promote the development of high quality tests and the sound use of 
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results from such tests. Without such high quality standards, tests might produce scores that are 
not defensible or accurate, not an adequate reflection of the characteristic they were intended to 
measure, and not fair to the person tested. Consequently, decisions about individuals made with 
such test scores would be no better, or even worse, than those made with no test score 
information at all. Thus, the Standards help to ensure that measures of student achievement are 
relevant, that admissions decisions are fair, that employment hiring and professional 
credentialing result in qualified individuals being selected, and patients with psychological needs 
are diagnosed properly and treated accordingly. Quality tests protect the public from harmful 
decision making and provide opportunities for education and employment that are fair to all who 
seek them. 

15. The Standards apply broadly to a wide range of standardized instruments and 
procedures that sample an individual’s behavior, including tests, assessments, inventories, scales, 
and other testing vehicles. The Standards apply equally to standardized multiple-choice tests, 
performance assessments (including tests comprised of only open-ended essays), and hands-on 
assessments or simulations. The main exceptions are that the Standards do not apply to 
unstandardized questionnaires (e.g., unstructured behavioral checklists or observational forms), 
teacher-made tests, and subjective decision processes (e.g., a teacher’s evaluation of students’ 
classroom participation over the course of a semester). 

16. The Standards have been used to develop testing guidelines for such activities as 
college admissions, personnel selection, test translations, test user qualifications, and computer- 
based testing. The Standards also have been widely cited to address technical, professional, and 
operational norms for all forms of assessments that are professionally developed and used in a 
variety of settings. The Standards additionally provide a valuable public service to state and 
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federal governments as they voluntarily choose to use them. For instance, each testing company, 
when submitting proposals for testing administration, instead of relying on a patchwork of local, 
or even individual and proprietary, testing design and implementation criteria, may rely instead 
on the Sponsoring Organizations’ Standards to afford the best guidance for testing and 
assessment practices. 

17. The Standards were not created or updated to serve as a legally binding document, 
in response to an expressed governmental or regulatory need, nor in response to any legislative 
action or judicial decision. However, the Standards have been cited injudicial decisions related 
to the proper use and evidence for assessment, as well as by state and federal legislators. These 
citations in judicial decisions and during legislative deliberations occurred without any lobbying 
by the Plaintiffs. 

18. NCME has never communicated with Congress for the purpose of encouraging 
the enactment of the Standards into law. 

19. Additionally, NCME has never solicited any government agency to incorporate 
the Standards into the Code of Federal Regulations or other rules of Federal or State agencies. 

20. In the policymaking arena, NCME believes the Standards should be treated as 
guidelines informing the enactment of legislation and regulations consistent with best practices 
in the development and use of tests - to insure that they are valid, reliable and fair. 

21. The Sponsoring Organizations promote and sell copies of the Standards via 
referrals to the AERA website, at annual meetings, in public offerings to students, and to 
educational institution faculty. Advertisements promoting the Standards have appeared in 
meeting brochures, in scholarly journals, and in the hallways at professional meetings. 
Accompanying this Declaration as Exhibit KKK is a true copy of advertisements for the 1999 
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Standards published in NCME’s Journal of Educational Management. These advertisements 
were produced at Bates Nos. AERAAPANCME 0031444-0031451. 

22. Distribution of the Standards is closely monitored by the Sponsoring 
Organizations. AERA, the designated publisher of the Standards, sometimes does provide 
promotional complementary print copies to students or professors. Except for these few 
complementary print copies, however, the Standards are not given away for free; and certainly 
they are not made available to the public by any of the three organizations for anyone to copy 
free of charge. 

23. To date, NCME has never posted, or authorized the posting of, a digitized copy of 
the 1999 Standards on any publicly accessible website. 

24. Without receiving at least some moderate income from the sales of the Standards 
to offset their production costs and to allow for further revisions, it is very likely that the 
Sponsoring Organizations would no longer undertake to periodically update them, and it is 
unknown who else would. 

25. In late 2013 and early 2014, the Sponsoring Organizations became aware that the 
1999 Standards had been posted on the Internet without their authorization, and that students 
were obtaining free copies from the posting source. Upon further investigation, the Sponsoring 
Organizations discovered that Public Resource was the source of the online posting. 

26. Public Resource posted Plaintiffs’ 1999 Standards to its website and the Internet 
Archive website without the permission or authorization of any of the Sponsoring Organizations. 

27. Plaintiffs have been made aware that at least some of those users who obtained 
the 1999 Standards for free from Public Resource did so to avoid paying the modest sale price 
for authorized print copies. 

-7- 


JA2524 


Case l:14-cv-00857-TSC Document 60-73 Filed 12/21/15 Page 8 of 26 
USCA Case #17-7035 Document #1715850 Filed: 01/31/2018 Page 221 of 517 

28. Accompanying this Declaration as Exhibit LLL is a true copy of an e-mail dated 
March 5, 2014 from Gregory J. Cizek to me regarding a student not purchasing the 1999 
Standards because “they [were] available for free online” at 
https://law.resource.org/pub/us/cfr/ibr/001/aera.standards.1999.pdf.” This e-mail exchange was 
marked as Exhibit 1252 during my deposition. 

I DECLARE, under the penalty of peijury, that the foregoing is true and correct. 


Dated: December , 2015 
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EXHIBIT 1 


LAURESS L. WISE 

Curriculum Vitae 


OVERVIEW 


Dr. Lauress Wise has over 35 years’ experience in educational research and continues 
extensive work on educational policy and assessment issues. Dr. Wise currently advises 
several states and the PARCC assessment consortium on technical issues in test development 
and use. He serves on the Board of the National Council of Measurement in Education as the 
immediate past-president. He is also serving on a National Research Council Committee that is 
evaluating the NAEP achievement levels. He recently co-chaired the panel that revise the 
AERA/APA/NCME Standards for Educational and Psychological Testing and previously chaired 
the National Academy of Sciences Board on Testing and Assessment. Recent research and 
development efforts include a 15-year independent evaluation of the California High School Exit 
Exam and quality assurance work for the National Assessment of Educational Progress 
(NAEP). Dr. Wise previously served on several National Research Council committees, 
chairing the Committees on Scientific Research in Education and the Evaluation of the National 
Voluntary Tests. 


EDUCATION 


Ph.D, Mathematical Psychology 1975 University of California, Berkeley 

B.S., Mathematics, Psychology (with Distinction) 1967 Stanford University 


AREAS OF EXPERTISE 


Test Development and Validation 
Program and Policy Evaluation 
Test Use Policy 


Project Management 

Statistical and Psychometric Issues 

Computer-Based Testing 


PROFESSIONAL EXPERIENCE 


Human Resources Research Organization 1994 - 2015 

Principal Scientist 

• Served as HumRRO’s president from 1994 to 2007. Remained active in research on testing 
and test use policy. Directed two major HumRRO educational testing projects, one to 
provide quality assurance for the National Assessment of Educational Progress (NAEP) and 
the other an independent evaluation of California’s High School Exit Exam (CAHSEE). He 
continues to serve as a senior psychometric advisor for a graduate school admissions 
testing program. 

• Served as co-chair of the committee that revised the 1999 AERA/APA/NCME Standards for 
Educational and Psychological Testing, and has previously served as Chair of the National 
Academy of Science (NAS) Board on Testing and Assessment and chaired the NAS 
Committee on Research in Education. 

• Currently serves on technical advisory committees for the Hawaii, Wyoming, Utah, 
Tennessee, and Virginia departments of education, and the Partnership for Assessing 
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Readiness for College and Career (PARCC) advisory committees. Also serves on the 
Rhode Island Technical Advisory Committee for Teacher Evaluation. 

• Served as co-Principal Investigator on the first year of the Congressionally-mandated 
evaluation of President Clinton’s Voluntary National Tests and chaired the NAS committee 
that performed the second year of that evaluation, and on the NAS committee to evaluate 
the NAEP and on the National Academy of Education's Panel for the Evaluation of the 
NAEP Trial State NAEP. 

• Other work includes vertical alignment of state content standards, modeling the effects of 
motivation on examinee performance on low-stakes assessments, the impact of changes in 
exclusions on NAEP results for Kentucky, and scaling constructed response and multiple 
choice items on the Florida assessment. Dr. Wise also worked on the development and 
validation of a computer-administered assessment now used for selection of air traffic 
controllers and developed a computer-based system for assessing work values as part of a 
Department of Labor effort to develop improved career guidance tools. 

Defense Manpower Data Center 1990 - 1994 

Chief, Personnel Testing Division 

• Spokesperson for the Department of Defense on matters relating to the development and 
use of cognitive tests. Dr. Wise’s unit was responsible for all research and development for 
the Armed Services Vocational Aptitude Battery (ASVAB). 

• Work included evaluation and implementation of a computerized, adaptive version of the 
ASVAB, automated item and form development procedures, new career exploration 
procedures for use with the high school testing program, the development and testing of a 
new career interest inventory, and extensive validity research. 

American Institutes for Research (AIR) 1974 - 1990 

Associate Research Scientist to Director of Research 

• Directed a variety of studies and projects, including the Review and Analysis of the General 
Aptitude Test Battery (GATB) Project for the U.S. Employment Service within the 
Department of Labor, the Army Synthetic Validation Project, the Army's Computerized 
Adaptive Screening Test (CAST) Revision Project, and validation studies for the Software 
Engineering Institute at Carnegie Mellon University. 

• Served as director of analysis for the Army's massive Project A, analyzing new selection 
tests, developing models of performance in a variety of jobs, and assessing the validity of 
each new test for predicting different facets of performance in different jobs. 

• Served for twelve years as the chief psychometrician for the Medical College Admissions Test, 
developing procedures for the screening and calibration of new items and for the construction 
and equating of new forms. Also consulted with the Department of Education on issues related 
to testing and data analysis as part of the Statistical Analysis Group in Education. 

• From 1978 to 1982, served as Director of Project TALENT, a nationally representative 
longitudinal study of nearly 400,000 members of the high school classes of 1960 through 
1963. Oversaw the collection of the final wave of follow-up data and conducted targeted 
research on issues such as gender differences in mathematics achievement, school 
differences in student achievement, the development of careers in science and medicine, 
and the consequences of adolescent childbearing. 
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University of California 

Programmer and Instructor 

• While a graduate student at the University of California, Dr. Wise served as the computer 
consultant for the Psychology Department, helping both faculty and students in the design and 
execution of data analyses. He also taught an undergraduate course in Psychological Statistics. 

California Department of Public Health 1968 - 1972 

Computer Programmer and Data Processing Systems Analyst 

• Created data systems to support licensing functions and vital statistics systems at the 
Department of Public Health. Was in-house project manager for a new management 
information system that involved defining "outputs" for each bureau and department and 
relating these outputs to costs. 


PROFESSIONAL AFFILIATIONS AND SERVICE 


• American Educational Research Association (AERA) 

• American Psychological Association (APA) 

- Divisions 5, 14, 19 

• Member, National Council on Measurement in Education (NCME) 

• Psychometric Society 

SELECTED BIBLIOGRAPHY 


Publications 

Wise, L. L. (in press). How we got to where we are: Evolving policy demands for the next 
generation assessments. In Next Generation Assessments. R. Lissitz, Ed. (To be published in 
2016. 

Wise, L. L. & Plake, B. S. (2015). "Test design and development following the Standards for 
Educational and Psychological Testing". In Lane, S., Haladyna, T., and Raymond, M. (Eds.). 
Handbook of Test Development. New York, NY: Routledge. 

Plake, B. S„ & Wise, L. L. (2014). Revision of the AERA, APA, NCME Standards for 
Educational and Psychological Testing : What is their role and importance for NCME. 
Educational Measurement: Issues and Practice, 33(4), 4-12. 

Wise, L. L. (2010). Accessible Reading Assessments for Students with Disabilities: Summary 
and Conclusions. Applied Measurement in Education, 23(2), 209-214. 

Wise, L. L. (2006). Encouraging and Supporting Compliance with Standards for Educational 
Tests. Educational Measurement: Issues and Practice 25(3), 51-53. 

National Research Council. (2005). Advancing Scientific Research in Education. Committee on 
Research in Education. Lisa Towne, Lauress L. Wise, and Tina M Winders, Editors. Center for 
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Education, Division of Behavioral and Social Sciences and Education. Washington, DC: The 
National Academy Press. 

National Research Council. (2004). Strengthening Peer Review in Federal Agencies That 
Support Education Research. Committee on Research in Education. L. Towne, J.M. Fletcher, 
and L.L. Wise, Eds. Center for Education, Division of Behavioral and Social Sciences and 
Education. Washington, DC: The National Academies Press. 

Wise, L.L. (2004). The National Assessment of Educational Progress - what it tells educators. In 
J.E. Wall & G.R. Walz (Eds.). Measuring up: Assessment issues for teachers, counselors, and 
administrators (pp. 729-741). Greensboro, NC: CAPS Press. 

Wise, L.L. & Hoffman, R.G. (2002). How will assessment data to be used to document the 
impact of educational reform. In R. W. Lissitz and W. D. Schafer (Eds.) Assessment in 
Educational Reform: Both means and ends. Boston, MA: Allyn & Bacon. 

Wise, L.L., Noeth, R.J., & Koenig, J.A. (Eds.) (1999). Evaluation of the voluntary national tests, 
year 2 interim report. National Research Council. Washington, DC: National Academy Press. 

Wise, L.L., Hauser, R.M., Mitchell, K.J., & Feuer, M.J. (1999). Evaluation of the voluntary 
national tests: Phase I. National Research Council, Commission on Behavioral and Social 
Sciences and Education, Board on Testing and Assessment. Washington, DC: National 
Academy Press. 

Wise, L.L., Curran, L.T. Curran, & McBride, J.R. (1997). CAT-ASVAB cost and benefit analyses. 
In W.A. Sands, B.K. Waters, & J.R. McBride (Eds.), Computerized Adaptive Testing: From 
Inquiry to Operation. Washington, DC: American Psychological Association. 

Wolfe, J.H., Alderton, D.L., Larson, G.E., Bloxom, B., & Wise, L.L. (1997). Expanding the 
content of CAT-ASVAB: New tests and their validity. In W.A. Sands, B.K. Waters, & J.R. 
McBride (Eds.), Computerized Adaptive Testing: From Inquiry to Operation. Washington, DC: 
American Psychological Association. 

Wall, J.E., Wise, L.L., & Baker, H.E. (1996). Development of the interest-finder: A new RIASEC- 
based interest inventory. Measurement and Evaluation in Counseling and Development, 29, 
134-152. 

Wise, L.L. (1994). Goals of the selection and classification decision. In M.G. Rumsey, C.B. 
Walker, & J.H. Harris (Eds.). Personnel selection and classification. Hillsdale, NJ: Lawrence 
Erlbaum Associates. 

Wise, L.L. (1994). Setting performance goals for the DOD Linkage Model. In B.F. Green & A.S. 
Mavor (Eds.) Modeling cost and performance for military enlistment. Washington, DC: National 
Academy of Sciences Press. 

Rudner, L.M., Wise, L.L., & Stonehill, R.M. (1991). The ERIC Clearinghouse on Tests, 
Measurement, and Evaluation (ERIC/TME) - A growing resource. Applied Measurement in 
Education, 4, 1-10. 
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Wise, L.L. (1991). The validity of test scores for selecting and classifying enlisted recruits. In 
B.R. Gifford & L.C. Wing (Eds.), Test policy in Defense: Lessons from the military for education, 
training, and employment. Boston, MA: Kluwer Academic Press. 

Campbell, J.P., McHenry, J.J., & Wise, L.L. (1990). Modeling job performance in a population of 
jobs. Personnel Psychology, 43, 313-334. 

Young, W.Y., Houston, J.S., Harris, J.H., Hoffman, R.G., & Wise, L.L. (1990). Large-scale 
predictor validation in Project A: Data collection procedures and data base preparation. 
Personnel Psychology, 43, 301-312. 

Wise, L.L., McHenry, J.J., & Campbell, J.P. (1990). Identifying optimal predictor composites and 
testing for generalizability across jobs and performance factors. Personnel Psychology, 43, 355- 
366. 

Wise, L.L., Campbell, J.P., & Arabian, J.M. (1988). The Army Synthetic Validation Project. In 

B. F. Green, H. Wing, & A.K. Wigdor (Eds.) Linking military enlistment standards to job 
performance. Washington, DC: National Academy Press. 

Wise, L.L. (1985). Project TALENT: Mathematics course participation in the 1960's and its 
career consequences. In S.F. Chipman, L.R. Brush, & D.M. Wilson (Eds.), Women and 
mathematics: Balancing the equation. Hillsdale, NJ: Lawrence Erlbaum Associates. 

Abeles, R.P., Steel, L.M., & Wise, L.L. (1980). Patterns and implications of life-course 
organization: Project TALENT studies. In P.B. Baltes & O.G. Brim, Jr. (Eds.), Lifespan 
development and behavior (Vol. III). New York: Academic Press. 

Wise, L.L., & Steel, L.M. (1980). Educational attainment of the high school classes of 1960 
through 1963: Findings from Project TALENT. In A.C. Kerckhoff (Ed.), Longitudinal 
perspectives on educational attainment. Greenwich, CT: JAI Press. 

Wise, L.L. (1979). Project TALENT: Studying the development of our human resource. In J.E. 
Milholland (Ed.), New directions for testing and measurement: Insights from large-scale 
surveys. San Francisco: Jossey-Bass Inc. 

Thacker, A. A., Dickinson, E. R., Bynum, B. H., Wen, Y., Smith, E. A., Sinclair, A. L., Deatz, R. 
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variables memorandum (2014 No. 005). Alexandria, VA: Human Resources Research Organization. 
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2002 data replication - final design and management plan (FR-03-58). Alexandria, VA: Human 
Resources Research Organization. 

Wise, L.L., Sellman, S., & Sipes, S. (2003). Notes from the Meeting of the National Assessment 
Governing Board May 15-17, 2003 (SR-03-32). Alexandria, VA: Human Resources Research 
Organization. 

Hoffman, R.G., Becker, D.E., & Wise, L.L. (2003). NAEP quality assurance checks of the 2002 
reading assessment results for Delaware (FR-03-25). Alexandria, VA: Human Resources 
Research Organization. 

Hoffman, R.G., Wise, L.L., & Sticha, P. J. (2003). Review of NAEP quality control plans (TR-03- 
07). Alexandria, VA: Human Resources Research Organization. 

Hoffman, R. G. & Wise, L. L. (2003). The accuracy of school classifications for the 2002 
accountability cycle of the Kentucky Commonwealth Accountability Testing System (FR-03-06). 
Alexandria, VA: Human Resources Research Organization. 

Wise, L.L., Becker, D.E., & Ramsberger, P.F. (2003). Report on past NAEP problems (FR-03- 
03). Alexandria, VA: Human Resources Research Organization. 

Wise, L.L., Sipes, D.E., Harris, C.D., Ford, J.P., Sun, S., Dunn, J., & Goldberg, G.L. (2002). 
Independent evaluation of the California High School Exit Examination (CAHSEE): Year 3 
evaluation report (IR-02-28). Alexandra, VA: Human Resources Research Organization. 

Wise, L.L., Sipes, D.E., Harris, C.D., George, C.E., Ford, J.P., & Sun, S. (2002). Independent 
evaluation of the California High School Exit Examination (CAHSEE): Analysis of the 2001 
administration (FR-02-02). Alexandra, VA: Human Resources Research Organization. 

Wise, L.L., Sipes, D.E. (Sunny), George, C.E., Ford, J. P., & Harris, C.D. (2001). California High 
School Exit Examination (CAHSEE): Year 2 evaluation report (IR-01-29). Alexandria, VA: 

Human Resources Research Organization. 

Hoffman, R.G., & Wise, L.L. (2001). The accuracy of school classifications for the interim 
accountability cycle of the Kentucky Commonwealth Accountability and Testing System. (FR- 
01-26). Alexandria, VA: Human Resources Research Organization. 

McBride, J.R., Paddock, A.F., Wise, L.L., Strickland, W.J., & Waters, B.K. (2001). Testing via 
the internet: A literature review and analysis of issues for Department of Defense internet testing 
of the Armed Services Vocational Aptitude Battery (ASVAB) in the high schools (FR-01-12). 
Alexandria, VA: Human Resources Research Organization. 

Hoffman, R.G., & Wise, L.L. (2000). The accuracy of students’ novice, apprentice, proficient, and 
distinguished classifications of the Kentucky Core Content Test (FR-00-25). Alexandria, VA: 

Human Resources Research Organization. 

Hoffman, R.G., & Wise, L.L. (2000). School classification accuracy final analysis plan for the 
commonwealth accountability and testing system (FR-00-26). Alexandria, VA: Human Resources 
Research Organization. 
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Hoffman, R.G., Thacker, A., & Wise, L.L. (2000). The accuracy of students’ novice, apprentice, 
proficient, and distinguished classifications for the 2000 Kentucky Core Content Test (FR-00-41). 
Alexandria, VA: Human Resources Research Organization. 

Wise, L.L., Harris, C.D., Sipes, D.E., Hoffman, R.G., & Ford, J.P. (2000). High school exit 
examination (HSEE): Year 1 evaluation report (IR-00-27r). Alexandria, VA: Human Resources 
Research Organization. 

Wise, L.L., Harris, C.D., Sipes, D.E., Collins, M.M., Hoffman, R.G., & Ford, J.P. (2000). High 
school exit examination (HSEE): Supplemental year 1 evaluation report (IR-00-37). Alexandria, 
VA: Human Resources Research Organization. 

Hoffman, R.G., & Wise, L.L. (1999). Establishing the reliability of student level classifications: 
Analytic plan and demonstration (FR-WATSD-99-34). Alexandria, VA: Human Resources 
Research Organization. 

Wise, L.L., Hoffman, R.G., & Thacker, A.A. (August 1999). Evaluation of calibration and 
equating procedures for the Florida State Assessment. (FR-WATSD-99-41). Alexandria, VA: 
Human Resources Research Organization. 

Wise, L.L., McCloy, R.A., & Quartetti, D.A. (1998). Indicators of student effort on the National 
Assessment of Educational Progress (DFR-EADD-98-58). Alexandria, VA: Human Resources 
Research Organization. 

McCloy, R.A., Russell, T.L., & Wise L.L. (Eds.). (1997). General Aptitude Test Battery (GATB) 
improvement project final report. Washington, DC: U.S. Department of Labor, Divisions of Skills 
Assessment and Analysis, Office of Policy and Research, Employment and Training 
Administration. 

Wise, L.L. (1997, June). Merging ASVAB and KIRIS on-demand scores: Report of preliminary 
results (LRS 97-4). Frankfort, KY: Kentucky Department of Education. 

Wise, L.L., Welsh, J., Grafton, F., Foley, P., Earles, J, Sawin, L., & Divgi, D.R. (1992). Sensitivi¬ 
ty and fairness of the Armed Services Vocational Aptitude Battery (ASVAB) technical 
composites. Monterey, CA: Defense Manpower Data Center. 

Wise, L.L., Chia, W.J., & Rudner, L.M. (1990). Identifying necessary job skills: A review of 
previous approaches. Washington, DC: Pelavin Associates, Inc. 

Wise, L.L., Peterson, N.G., Hoffman, R.G., Campbell, J.P., & Arabian, J.M. (1990). Army Synthetic 
Validity Project: Report of phase III results. Washington, DC: American Institutes for Research. 

Wise, L.L., McHenry, J.J., Chia, W.J., Szenas, P.L., & McBride, J.R. (1989). Refinement of the 
Computer Adaptive Screening Test (CAST). Alexandria, VA: U.S. Army Research Institute for 
the Behavioral and Social Sciences. 

Wise, L.L., Hough, L.M., Szenas, P.L., & Keyes, M.A. (1988). Phase I report: armed services 
applicant profile (ASAP) item fairness analysis. Washington, DC: American Institutes for Research. 
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Wise, L.L., McHenry, J.J., & Young, W.Y. (1986). Project A concurrent validation: treatment of 
missing data (RS-WP-86-08). Alexandria, VA: U.S. Army Research Institute for the Behavioral 
and Social Sciences. 

McLaughlin, D.H., Rossmeissl, P.G., Wise, L.L., Brandt, D.A., & Wang, M. (1984). Validation of 
current and alternative ASVAB area composites based on training and SQT information on FY 
1982 enlisted accessions (Technical Report No. 651). Alexandria, VA: U.S. Army Research 
Institute for the Behavioral and Social Sciences. 

Card, J.J., & Wise, L.L. (1981). Teenage mothers and teenage fathers: The impact of early 
childbearing on the parents' personal and professional lives. In F.F. Furstenberg, R. Lincoln, & 

J. Menken (Eds.), Teenage sexuality, pregnancy, and childbearing. Philadelphia: University of 
Pennsylvania Press. 

Wilson, S.R., Stancavage, F.B., & Wise, L.L. (1981). Synthesis of recent research on medical 
career decisions: A comparative study of two generations of physicians. Palo Alto, CA: 

American Institutes for Research. 

Steel, L.M., & Wise, L.L. (1977). Designing a study of adult accomplishment and life quality. 

Palo Alto, CA: American Institutes for Research. 

Wise, L.L., McLaughlin, D.H., & Steel, L.J. (1977). The Project TALENT data bank handbook. 
Palo Alto, CA: American Institutes for Research. 

Wise, L.L., McLaughlin, D.H., & Gilmartin, K.G. (1977). The American citizen: Eleven years 
after high school, Volume II. Palo Alto, CA: American Institutes for Research. 

Gilmartin, K.G., McLaughlin, D.H., Wise, L.L., & Rossi, R.J. (1976). Development of scientific 
careers: The high school years. Palo Alto, CA: American Institutes for Research. 

Rossi, R.J., Bartlett, W.B., Campbell, E.A., Wise, L.L., & McLaughlin, D.H. (1975). Using the 
TALENT profiles in counseling: A supplement to the career data book. Palo Alto, CA: American 
Institutes for Research. 

Wilson, S.R., & Wise, L.L. (1975). The American citizen: Eleven years after high school. Palo 
Alto, CA: American Institutes for Research. 

Presentations 

Wise, L. L. (2015). The Standards for Educational and Psychological Tests: Implications for 
Peer Review of State Assessments. Workshop for state testing directors at the National 
Conference on Student Assessment, San Diego, CA. 

Wise, L. L. (2015). Psychometric Considerations for the Next Generation of Performance 
Assessment: What are the implications for state assessment programs? Paper presented at the 
National Conference on Student Assessment, San Diego, CA. 

Wise, L. L. (2015). Educational Measurement: What lies ahead? Presidential address to the 
annual meeting of the National Council on Measurement in Education. Chicago, IL. 
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Wise, L. L. (2015). Psychometric Considerations for the Next Generation of Performance 
Assessment: Modeling, Dimensionality, and Weighting of Performance Task Scores. Paper 
presented at the annual meeting of the National Council on Measurement in Education, 

Chicago, IL. 

Wise, L. L. (2015). Test Design and Development Following the Standards for Educational and 
Psychological Testing. Presentation to the annual meeting of the National Council on 
Measurement in Education. Chicago, IL. 

Wise, L.L. (2013). Different but Comparable: Good enough for government work?. Paper 
presented at the annual meeting of the National Council on Measurement in Education. San 
Francisco, CA. 

Wise, L.L. (2012). Combining multiple indicators of achievement and growth. Paper presented 
at the annual meeting of the National Council on Measurement in Education. Vancouver, BC. 

Wise, L.L. (2012). Prior linking efforts: The best laid plans .... Paper presented at the annual 
meeting of the National Council on Measurement in Education. Vancouver, BC. 

Wise, L.L. (2012). Revising our Standards for Educational and Psychological Testing. Paper 
presented at the Association of Test Publisher’s Innovations in Test Design Conference. Palm 
Springs, CA. 

Wise L. (April 2011). Situating the Generalizability of Performance Assessments within a Validity 
Framework. Paper presented at the 2011 Annual Meeting of the National Council on 
Measurement in Education, New Orleans, LA. 

Wise, L. (April 2011). Aggregating Results from Through-Course Assessments. Paper 
presented at the 2011 Annual Meeting of the National Council on Measurement in Education, 
New Orleans, LA. 

Wise, L. (April 2011). Aggregating Results from Through-Course Assessments. Paper 
presented at the National Conference on Student Assessment, Orlando, FL. 

Wise, L. (April 2011). The Evolving U.S. Educational System: How Can 1-0 Psychology 
Contribute? Paper presented at the annual meeting of the Society for Industrial and 
Organizational Psychology, Chicago, IL. 

Wise, L. (June 2011). Update on Revision of the Standards for Educational and Psychological 
Testing. Paper presented at the National Conference on Student Assessment, Orlando, FL. 

Wise, L. (August 2011). Update on Revision of the Standards for Educational and Psychological 
Testing. Paper presented at the annual convention of the American Psychological Association, 
Washington, DC. 

Wise, L.L. (2009) Revising our Test Standards: Issues with Increased Use of Tests for 
Accountability. Presentation at the 2009 annual meeting of the American Psychological 
Association, Toronto, Canada. 

Wise, L.L. (2009, June) Revising our Test Standards. Presentation to the CCSSO National 
Conference on Student Assessment, Los Angeles, CA. 
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Wise, L.L. (2009, April) Revising our Test Standards: Issues for Work Place Testing. 
Presentation at the 2009 annual meeting of the American Educational Research Association, 
San Diego, CA. 

Wise, L.L. (2008). Validating Indicators of College Preparedness: Ready or Not? Presentation 
to the 2008 Reidy Interactive Lecture Series. Portsmouth, NH 

Wise, L.L. (September 2008). State Assessments Today: What State are We In'? Presentation 
to the Conference on Educational Testing in America: State Assessment Achievement Gaps, 
Federal Policy and Innovations. Washington, DC. 

Wise, L.L. (June 2008). Strengthening K-12 Accountability Systems: What could be better than 
peer review?. Presentation to the CCSSO National Conference on Student Assessment, 
Orlando, FL. 

Wise, L.L., & Plake, B. (June 2008). Procedures for Revising the Test Standards. Presentation 
to the CCSSO National Conference on Student Assessment, Orlando, FL. 

Wise, L.L., & Rui, N. (2008, March). Computing and Communicating Test Accuracy for High- 
Stakes Decisions. Paper presentation at the 2008 annual meeting of the National Council on 
Measurement in Education, New York City, NY. 

Wise, L.L. (2007). Vertical Alignment. Paper presented at the CCSSO Large Scale Assessment 
Conference, Nashville, TN. 

Shen, X. (Stanford University; former HumRRO intern), Wise, L.L., and Becker, D.E. (2006, 
April). Analysis of School Effects. In L. Roberts (Chair), Evaluating the Impact of a High-School 
Graduation Test. Symposium conducted at the annual conference of the American Educational 
Research Association, San Francisco. 

Wang, X. , Wise, L.L., Becker, D.E., and Taylor, L.R. (2006, April). Comparison of CAHSEE 
Content Standards to Graduation and High-School Accountability Tests Used in Other States. In 
L. Roberts (Chair), Evaluating the Impact of a High-School Graduation Test. Symposium 
conducted at the annual conference of the American Educational Research Association, San 
Francisco. 

Wise, L. (Discussant) (2006, May). Expanding our influence: How I -O psychologists can 
improve education. Practice Forum conducted at the 21 st Annual Conference of the Society for 
Industrial and Organizational Psychology, Dallas, TX. 

Wise, L.L. and Becker, D.E. (2006, April). Analysis of Results from Administrations of the 
California High School Exit Exam (CAHSEE). In L. Roberts (Chair), Evaluating the Impact of a 
High-School Graduation Test. Symposium conducted at the annual conference of the American 
Educational Research Association, San Francisco. 

Wise, L. W. (2004). Independent evaluation of the California High School Exit Exam (CAHSEE). 
Presentation at the American Educational Research Association Annual Meeting, San Diego, CA. 

Wise, L. W. (2004). Improving scientific research in education: Recent activities of the National 
Research Council. Presidentially Invited Symposium conducted at the American Educational 
Research Association Annual Meeting, San Diego, CA. 
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Wise, Lauress L. (2004). Vertically articulated content standards. Presentation for the Reidy 
Interactive Lecture Series. Nashua, NH. 

Wise, Lauress L. (2004). Meeting Alignment challenges: Analyzing vertical alignment. Paper 
presented at the CCSSO Large Scale Assessment Conference. Boston, MA. 

Wise, Lauress L. (2004). Independent Evaluation of the CAHSEE: Update on Evaluation 
Findings and Recommendations. Presentation to the California State Board of Education, 
Sacramento, CA. 

Wise, Lauress L. (2004). Debra P. v. Turlington and CAHSEE: Those who do not study history 
are doomed to repeat it: Presidential Invited Session. Annual Meeting of the American 
Educational Research Association. San Diego, CA. 

Wise, L. L., Floden, R.E., Dickersin, K. & Schneider, B.L. (2004). Improving Scientific Research 
in Education: Recent Activities of the National Research Council. Presidential Invited Session. 
Annual Meeting of the American Educational Research Association. San Diego, CA. 

Wise, L.L. (2001). Building a selection test battery: Validity, fairness, and other tradeoffs. 
Presentation to the Personnel Testing Council/Metropolitan Washington. Washington, DC: 
National Research Council. 

Wise, L.L. (2001). Validity, Fairness, and other Tradeofs. Presentation to the Personnel Testing 
Council/Metropolitan Washington. Washington, DC: National Research Council. 

Wise, L.L. (1999). How far should NAEP go in serving an interpretive function ? Presentation to 
the Forum on NAEP Design: 2000-2010. Washington, DC: National Research Council. 

Wise, L.L. & Hauser, R.M. (1999). Evaluation of the Voluntary National Tests. Paper presented 
at the annual meeting of the American Educational Research Association, Montreal, Canada. 

Waugh, G.W., Wise, L.L., Quartetti, D.A., & Ramos, R.A. (1999). Validation of the air traffic 
controller predictor tests. In R.A. Ramos, (Chair), Air traffic selection and training project. 
Symposium conducted at the 14th Annual Conference of the Society for Industrial and 
Organizational Psychology, Inc., Atlanta, GA. 

Wise, L.L., Quartetti, D.A. (HumRRO), Kieckhaefer, W.F. (RGI), & Houston, J.S. (PDRI). (1999). 
Development of air traffic controller predictor battery. In R.A. Ramos (Chair), Air traffic selection 
and training project. Symposium conducted at the 14 th Annual Conference of the Society for 
Industrial and Organizational Psychology, Inc., Atlanta, GA. 

Wise, L.L. (1999, April). How far should NAEP go in serving an interpretive function? 
Presentation to the Forum on NAEP Design: 2000-2010. Washington, DC: National Research 
Council. 

Wise, L.W. (1999). Test-taking motivation as persistence: Its effect on item and test 
performance. Paper presented at the Festshiftfor William Meredith, Berkeley, CA. 

Wise, L.L., Quartetti, D.A., Kieckhaefer, W.F., & Houston, J.S. (1999). Development of air traffic 
controller predictor battery. In R.A. Ramos (Chair), Air traffic selection and training project. 
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Symposium conducted at the 14th Annual Conference of the Society for Industrial and 
Organizational Psychology, Inc., Atlanta, GA. 

Wise, L.L. (1998, April). Relationship of Kentucky High School Assessment Scores to 
Predictors. In R.G. Hoffman (Chair), Symposium conducted at the Annual Meeting of the 
National Council on Measurement in Education, San Diego, CA. 

Wise, L.L. (1998, April). Discussant in Statistical support for CAT implementation. Paper 
Session at the Annual Meeting of the National Council on Measurement in Education, San 
Diego, CA. 

Wise, L.L. (1997). Career directions beyond the dissertation: R&D "think tanks." Presenter at 
invited symposium, Career directions in measurement: Beyond the dissertation, sponsored by 
NCME Graduate Student Issues Committee and co-sponsored by AERA Division D, at the 1997 
National Council on Measurement in Education Annual Meeting, Chicago, IL. 

Wise, L.L. (1996, April). A persistence model of motivation and test performance. Paper presented 
at the annual meeting of the American Educational Research Association, New York, NY. 

Curran L.T., & Wise, L.L. (1994). Enlistment processing changes. Paper presented at the 
annual meeting of the International Military Testing Association, The Netherlands. 

Welsh, J.R., & Wise, L.L. (1994). The stability of the Mantel-Haenszel odds ratio. Paper 
presented at the annual meeting of the International Military Testing Association, The 
Netherlands. 

Wise, L.L., & Welsh, J.R. (November 1993). Order adjustment and cross-correlation in ASVAB 
test form development. Paper presented at the annual meeting of the International Military 
Testing Association, Williamsburg, VA. 

Wise, L.L., & Wall, J.E. (1993, November). Plan for evaluating the ASVAB career exploration 
program. Paper presented at the annual meeting of the International Military Testing 
Association, Williamsburg, VA. 

Wise, L.L., & Curran, L.T. (1993, August). Introducing the new ASVAB: Recommendations and 
decisions for change. Paper presented at the annual meeting of the American Psychological 
Association, Toronto, Canada. 

Wise, L.L. (1993, April). Scoring rubrics for performance tests: Lessons learned from job 
performance assessment in the military. Paper presented at the annual meeting of the National 
Council on Measurement in Education, Atlanta, GA 

Wise, L.L. (1993, April). Test form accuracy. Paper presented at the annual meeting of the 
National Council on Measurement in Education, Atlanta, GA. 

Wise, L.L. (1992, April). Lessons learned from military performance assessment. Paper 
presented at the annual meeting of the National Council on Measurement in Education, San 
Francisco, CA. 

Wise, L.L. (1991, October). Overview of the ASVAB revision process. Paper presented at the 
annual meeting of the Military Testing Association, San Antonio, TX. 


JA2541 


Case l:14-cv-00857-TSC Document 60-73 Filed 12/21/15 Page 25 of 26 
USCA Case #17-7035 Document #1715850 Filed: 01/31/2018 Page 238 of 517 


Wise, L.L., & McDaniel, M.A. (1991). Cognitive factors in the Armed Services Vocational 
Aptitude Battery and the General Aptitude Test Battery. Paper presented at the annual meeting 
of the American Psychological Association, San Francisco, CA. Wise, L.L., Chia, W.J., & Park, 
R.K. (1989). Effects of item position on IRT parameter estimates and item statistics. Paper 
presented at the annual meeting of the American Educational Research Association, San 
Francisco, CA. 

Arabian, J.M., Wise, L.L., & Szenas, P.L. (1988). Setting performance standards for Army 
enlisted jobs. Paper presented at the annual meeting of the American Psychological Associa¬ 
tion, New Orleans, LA. 

Szenas, P.L., Wise, L.L., & Arabian, J.M. (1988). Combining individual standards into an overall 
standard: Modeling the judgement process and investigating differences among judges. Paper 
presented at the annual meeting of the American Psychological Association, New Orleans, LA. 

Wise, L.L., McHenry, J.J., & Campbell, J.P. (1987). Matching skills and traits to job require¬ 
ments: Results from Project A. Paper presented at the annual meeting of the American 
Educational Research Association, Washington, DC. 

Wise, L.L. (1987). Differential item difficulty indicators in small samples. Paper presented at the 
annual meeting of the American Educational Research Association, Washington, DC. 

Wise, L.L., McHenry, J.J., Rossmeissl, P.G., & Oppler, S.H. (1986). ASVAB validities using 
improved job performance measures. Paper presented at the annual meeting of the Military 
Testing Association, Mystic, CT. 

Wise, L.L., Campbell, J.P., McHenry, J.J., & Hanser, L.M. (1986). A latent structure model of job 
performance factors. Paper presented at the annual meeting of the American Psychological 
Association, Washington, DC. 

Wise, L.L. (1986). Latent trait models for partially speeded tests. Paper presented at the annual 
meeting of the American Educational Research Association, San Francisco, CA. 

Wise, L.L., & Mitchell, K.J. (1985). Development of an index of maximum validity increment for 
new predictor measures. Paper presented at the annual meeting of the American Psychological 
Association, Los Angeles, CA. 

Wise, L.L., & Wilson, S.R. (1982). Test item calibration. Paper presented at the annual meeting 
of the American Educational Research Association, New York, NY. 

Wise, L.L., & McLaughlin, D.H. (1981). Survey data enhancement. Paper presented at the 
annual meeting of the American Educational Research Association, Los Angeles, CA. 

Wise, L.L., Wilson, S.R., & Stancavage, F.B. (1980). The development of medical practice and 
residence value scales that distinguish physicians in different specialties and practice locations. 
Paper presented at the annual meeting of the American Educational Research Association, 
Boston, MA. 

Wise, L.L., & Steel, L.M. (1979). The effects of school quality on student's knowledge and skills. 
Paper presented at the annual meeting of the American Educational Research Association, San 
Francisco, CA. 
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Wise, L.L. (1979). Long-term consequences of sex differences in high school mathematics 
education. Paper presented at the annual meeting of the American Educational Research 
Association, San Francisco, CA. 

Steel, L.M., & Wise, L.L. (1979). Origins of sex differences in high school mathematics 
achievement and participation. Paper presented at the annual meeting of the American 
Educational Research Association, San Francisco, CA. 

Wise, L.L. (1978). The role of mathematics in women's career development. Paper presented at 
the annual meeting of the American Psychological Association, Toronto, Canada. 
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UNITED STATES DISTRICT COURT 
FOR THE DISTRICT OF COLUMBIA 


AMERICAN EDUCATIONAL RESEARCH ) 

ASSOCIATION, INC., AMERICAN ) 

PSYCHOLOGICAL ASSOCIATION, INC., ) 

and NATIONAL COUNCIL ON ) 

MEASUREMENT IN EDUCATION, INC., ) Civil Action No. 1:14-cv-00857-TSC-DAR 

) 

Plaintiffs, ) DECLARATION OF WAYNE 

) CAMARA IN SUPPORT OF 

v. ) PLAINTIFFS’ MOTION FOR 

) SUMMARY JUDGMENT AND ENTRY 
PUBLIC.RESOURCE.ORG, INC., ) OF A PERMANENT INJUNCTION 

) 

Defendant. ) 

_ ) 

I, WAYNE J. CAMARA, declare: 

1. I am the Senior Vice President, Research at ACT. My company produces and 
publishes the ACT® college readiness assessment — a college admissions and placement test 
taken millions of high school graduates every year. ACT also offers comprehensive assessment, 
research, information, and program management services to support education and workforce 
development. As the Senior Vice President of Research, I am responsible for all research and 
evidence related to the design, development, use, and validation of our assessments and 
programs. In my position, I serve on the Senior Leadership Team and manage over 110 
researchers. 

2. I submit this Declaration in support of the motion of the American Educational 
Research Association, Inc. (“AERA”), the American Psychological Association, Inc. (“APA”), 
and the National Council on Measurement in Education, Inc. (“NCME”) (collectively, 
“Plaintiffs” or “Sponsoring Organizations”) for summary judgment and the entry of a permanent 
injunction. 

3. Prior to working at ACT, I worked at The College Board, where I held the 
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positions of Vice President, Research and Development (July, 2000 - September, 2013), 
Executive Director, Office of Research and Development (March, 1997 - June, 2000), and 
Research Scientist (September, 1994-February, 1997). 

4. Before working at The College Board, I worked for APA in the positions of 
Assistant Executive Director for Scientific Affairs and Executive Director of Science (1992- 
1994), Director, Scientific Affairs (February, 1989 - August, 1992), and Testing and Assessment 
Officer (November, 1987 - January, 1989.) During my time at APA, I also served as the Project 
Director for the revision of the 1985 edition of the Standards for Educational and Psychological 
Testing published in 1999 (the “1999 Standards”). In 1997, I was elected to APA’s Council of 
Representatives, and I served on the Council from 1997-2003. In April, 2012, I was elected to 
the AERA Council, serving from April, 2012 to April, 2015 as Vice President for Division D. I 
was also elected to NCME’s Board of Directors, serving on the Board from 2002-2005 and 
2009-2012, and served as NCME’s President from 2010-2011. Additionally, I have served on 
the Management Committee for the Standards from 2005-2015. 

5. My curriculum vitae is attached to this Declaration as Exhibit 1. 

6. I have written extensively on the Standards, as well as other professional and 
technical guidelines which relate to educational and industrial testing and assessment, including 
journal articles, book chapters, and paper presentations at national conferences. 

7. In 1954, APA prepared and published the “Technical Recommendations for 
Psychological Tests and Diagnostic Techniques.” In 1955, AERA and NCME prepared and 
published a companion document entitled, “Technical Recommendations for Achievement 
Tests.” Subsequently, a joint committee of the three organizations modified, revised, and 
consolidated the two documents into the first Joint Standards. Beginning with the 1966 revision, 
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the Sponsoring Organizations collaborated in developing the “Joint Standards” (or simply, the 
“Standards”). Each subsequent revision of the Standards has been careful to note that it is a 
revision and update of the prior version. 

8. Beginning in the mid-1950s, the Sponsoring Organizations formed and 
periodically reconstituted a committee of highly trained and experienced experts in 
psychological and educational assessment, charged with the initial development of the Technical 
Recommendations and then each subsequent revision of the (renamed) Standards. These 
committees were formed by the Sponsoring Organizations’ Presidents (or their designees), who 
would meet and jointly agree on the membership. Often a chair or co-chairs of these committees 
were selected by joint agreement. Beginning with the 1966 version of the Standards, this 
committee became referred to as the “Joint Committee.” 

9. Financial and operational oversight for the Standards’ revisions, promotion, 
distribution, and for the sale of the 1999 and 2014 Standards has been undertaken by a 
periodically reconstituted Management Committee, comprised of designees of the three 
Sponsoring Organizations. As noted above, I served on this Management Committee from 2005- 
2015. 

10. All members of the Joint Committee(s) and the Management Committee(s) are 
unpaid volunteers. The expenses associated with the ongoing development and publication of 
the Standards include travel and lodging expenses (for the Joint Committee and Management 
Committee members), support staff time, printing and shipment of bound volumes, and 
advertising costs. 

11. From the time of their initial creation to the present, the preparation of and 
periodic revisions to the Standards entail intensive labor and considerable cross-disciplinary 
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expertise. Each time the Standards are revised, the Sponsoring Organizations select and arrange 
for meetings of the leading authorities in psychological and educational assessments (known as 
the Joint Committee). During these meetings, certain Standards are combined, pared down, 
and/or augmented, others are deleted altogether, and some are created as whole new individual 
Standards. The 1999 version of the Standards is nearly 200 pages, took more than five years to 
complete, and is the result of work put in by the Joint Committee to generate a set of best 
practices on educational and psychological testing that are respected and relied upon by leaders 
in their fields. 

12. Draft revisions of the 1985 Standards, for what became the 1999 Standards, were 
widely distributed for public review and comment during the revision process. The Joint 
Committee received thousands of pages of co mm ents and proposed text revisions from: the 
membership of the Sponsoring Organizations, scientific, professional, trade and advocacy 
groups, credentialing boards, state and federal government agencies, test publishers and 
developers, and academic institutions. While the Joint Committee reviewed and took under 
advisement these helpful comments, the final language of the 1999 Standards was a product of 
the Joint Committee members. When the 1985 Standards were revised, more than half the 
content of the 1999 Standards resulted from newly written prose of the Joint Committee. 

13. The Standards originally were created as principles and guidelines - a set of best 
practices to improve professional practice in testing and assessment across multiple settings, 
including education and various areas of psychology. The Standards can and should be used as a 
recommended course of action in the sound and ethical development and use of tests, and also to 
evaluate the quality of tests and testing practices. Additionally, an essential component of 
responsible professional practice is maintaining technical competence. Many professional 
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associations also have developed standards and principles of technical practice in assessment. 
The Sponsoring Organizations’ Standards have been and still are used for this purpose. 

14. The Standards, however, are not simply intended for members of the Sponsoring 
Organizations, AERA, APA, and NCME. The intended audience of the Standards is broad and 
cuts across audiences with varying backgrounds and different training. For example, the 
Standards also are intended to guide test developers, sponsors, publishers, and users by providing 
criteria for the evaluation of tests, testing practices, and the effects of test use. Test user 
standards refer to those standards that help test users decide how to choose certain tests, interpret 
scores, or make decisions based on tests results. Test users include clinical or industrial 
psychologists, research directors, school psychologists, counselors, employment supervisors, 
teachers, and various administrators who select or interpret tests for their organizations. There is 
no mechanism, however, to enforce compliance with the Standards on the part of the test 
developer or test user. The Standards, moreover, do not attempt to provide psychometric 
answers to policy or legal questions. 

15. The Standards promote the development of high quality tests and the sound use of 
results from such tests. Without such high quality standards, tests might produce scores that are 
not defensible or accurate, not an adequate reflection of the characteristic they were intended to 
measure, and not fair to the person tested. Consequently, decisions about individuals made with 
such test scores would be no better, or even worse, than those made with no test score 
information at all. Thus, the Standards help to ensure that measures of student achievement are 
relevant, that admissions decisions are fair, that employment hiring and professional 
credentialing result in qualified individuals being selected, and patients with psychological needs 
are diagnosed properly and treated accordingly. Quality tests protect the public from harmful 
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decision making and provide opportunities for education and employment that are fair to all who 
seek them. 

16. The Standards apply broadly to a wide range of standardized instruments and 
procedures that sample an individual’s behavior, including tests, assessments, inventories, scales, 
and other testing vehicles. The Standards apply equally to standardized multiple-choice tests, 
performance assessments (including tests comprised of only open-ended essays), and hands-on 
assessments or simulations. The main exceptions are that the Standards do not apply to 
unstandardized questionnaires ( e.g ., unstructured behavioral checklists or observational forms), 
teacher-made tests, and subjective decision processes (e.g., a teacher’s evaluation of students’ 
classroom participation over the course of a semester). 

17. The Standards have been used as a source in developing testing guidelines for 
such activities as college admissions, personnel selection, test translations, test user 
qualifications, and computer-based testing. The Standards also have been widely cited to 
address technical, professional, and operational norm s for all forms of assessments that are 
professionally developed and used in a variety of settings. The Standards additionally provide a 
valuable public service to state and federal governments as they voluntarily choose to use them. 
For instance, each testing company, when submitting proposals for testing administration, 
instead of relying on a patchwork of local, or even individual and proprietary, testing design and 
implementation criteria, may rely instead on the Sponsoring Organizations’ Standards to afford 
the best guidance for testing and assessment practices. 

18. The Standards were not created or updated to serve as a legally binding document, 
in response to an expressed governmental or regulatory need, nor in response to any legislative 
action or judicial decision. However, the Standards have been cited in judicial decisions related 
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to the proper use and evidence for assessment, as well as by state and federal legislators. These 
citations in judicial decisions and during legislative deliberations occurred without any lobbying 
by the Plaintiffs. 

19. The Sponsoring Organizations do not keep any of the revenues generated from the 
sales of the Standards. Rather, the income from these sales is used by the Sponsoring 
Organizations to offset their development and production costs and to generate funds for 
subsequent revisions. This allows the Sponsoring Organizations to develop up-to-date, high 
quality Standards that otherwise would not be developed due to the time and effort that goes into 
producing them. 

20. At one time, funding for the Standards revision process from third party sources 
( e.g ., governmental agencies, foundations, other associations interested in testing and assessment 
issues, etc.) was raised as a consideration. However, this option was not seriously explored as 
the potential conflicts of interest in doing so left the Sponsoring Organizations to conclude that 
the Standards revisions should be self-funding - that is, from the sale of prior editions of the 
Standards. 

21. In late 2013 and early 2014, the Sponsoring Organizations became aware that the 
1999 Standards had been posted on the Internet without their authorization, and that psychology 
students were obtaining free copies from the posting source. Upon further investigation, the 
Sponsoring Organizations discovered that Public.Resource.Org, Inc. (“Public Resource”) was the 
source of the online posting. Accompanying this Declaration as Exhibit MMM is a true copy of 
a thread of emails exchanged among Laurie Wise, Suzanne Lane, David Lrisbie, Jerry Sroufe, 
Marianne Ernesto, Barbara Plake, and myself 1 sent between December 16, 2013 and Lebruary 4, 

1 Laurie Wise is the Immediate Past President of NCME and was serving as President of NCME at the time of the 
email, Suzanne Lane is a member of the Standards Management Committee:, David Frisbie also is a member of the 
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2014, discussing Public Resource’s posting of the 1999 Standards on the Internet, and marked as 
Exhibit 1185 during my deposition. 

22. Past harm to the Sponsoring Organizations from Public Resource’s activities 
includes a lack of greater funding that otherwise would have been available for the update of the 
Sponsoring Organizations’ Standards from the 1999 to the 2014 versions, due to the reduced 
volume of sales of the 1999 Standards. 

23. Should Public Resource’s infringement be allowed to continue, the harm to the 
Sponsoring Organizations, and public at large who rely on the preparation and administration of 
valid, fair and reliable tests, includes: (i) uncontrolled publication of the 1999 Standards without 
any notice that those guidelines have been replaced by the 2014 Standards; (ii) future 
unquantifiable loss of revenue from sales of authorized copies of the 1999 Standards (with 
proper notice that they are no longer the current version) and the 2014 Standards; and (iii) lack of 
funding for future revisions of the 2014 Standards and beyond. 

24. Due to the small membership size of NCME, and the relative minor portion of the 
membership of AERA and APA who devote their careers to testing and assessment, it is highly 
unlikely that the members of the Sponsoring Organizations will vote for a dues increase to fund 
future Standards revision efforts if Public Resource successfully defends this case and is allowed 
to post the Standards online for the public to download or print for free. As a result, the 
Sponsoring Organizations would likely abandon their practice of periodically updating the 
Standards and there would be an absence of any authoritative and independent source of sound 
guidance relating to the development, use, and evaluation of psychological and educational tests. 


Standards Management Committee; Jerry Sroufe is the Director of Government Relations at AERA, Marianne 
Ernesto is the Director, Testing and Assessment, at APA, and Barbara Plake was Laurie Wise’s co-chair of the Joint 
Committee for the revision of the 1999 Standards, which ultimately were published in 2014. 
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Dated: December 8, 2015 


Wayne J. Camara 


-9- 


JA2552 



Case l:14-cv-00857-TSC Document 60-76 Filed 12/21/15 Page 10 of 32 
USCA Case #17-7035 Document #1715850 Filed: 01/31/2018 Page 249 of 517 

EXHIBIT 1 


WAYNE J. CAMARA 


OFFICE: 

ACT 

500 ACT Drive 
Iowa City, IA 52243-0168 
Tel (319) 337-1869 
wayne.camara@act.org 

EDUCATION: 

Ph.D. 1986 University of Illinois at Urbana, Psychology. 

Educational Measurement 

Cognate: Industrial and Organizational Psychology 

C.A.G.S. 1982 Rhode Island College (School Psychology), Providence, R.l. 

M.A. 1980 Rhode Island College (Educational Measurement), 

Providence, R.l. 

B.A. 1978 University of Massachusetts (Psychology/Education), N. Dartmouth, MA 

PROFESSIONAL EXPERIENCE: 

ACT, Iowa City, IA 

Senior Vice President, Research (September 2013 -) 

Oversees research departments across education and workforce assessments and services related to 
research, psychometrics, data reporting, statistical analysis, policy research, survey development, and 
industrial psychology services (e.g., job profiling). Manage a staff of over 125 professional staff. Serves on 
ACT's strategic leadership team and is responsible for shaping and guiding organizational direction and 
planning, as well as representing the organization with external audiences and stakeholders in areas 
including accountability, research, admissions testing, etc. Member of the Executive Leadership Team and 
business sponsor on a range of technology and development projects. 

The College Board, New York, NY 

Vice President, Research and Development (July, 2000 - September 2013) 

Executive Director, Office of Research and Development (March, 1997 - Present) 

Research Scientist (Sept. 1994 - Feb., 1997) 

Was responsible for all research, standards and alignment services, psychometric and assessment 
development activities at the College Board, including design and implementation of R&D activities that 
support College Board assessment programs (SAT, PSAT/NMSAT, AP, CLEP, Subject Tests, Accuplacer, 
SpringBoard, etc.). Managed a staff of approximately 75 professionals across several units and locations: 
Research, Statistics and Psychometrics, Test Development, Analysis and Reporting, and Standards 
Alignment. Responsible for policy research, outreach with state assessment directors, higher educational 
institutions, state Boards of Education and other policy and governance bodies. Coordinate product 
planning and business planning for new assessments and enhancements to current assessments. 
Responsible for several external advisory panels and test development committees. Responsible for 
reporting SAT aggregate results to institutions, reviewing all items and final forms of the SAT, and other 


HOME: 

81 Lewis St., 
Marion, MA 02738 
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operational work related to assessment development and delivery. Serves as a spokesperson for the 
College Board on technical and assessment policy discussions with the media, policymakers (e.g., 
testimony), institutions and other key stakeholders. Works with states, districts, policy makers and higher 
educational systems, to provide data, analyses and information concerning student achievement and 
college readiness. Directed data release process, guidelines and approvals. 

Project manager for development of the New SAT and represents the College Board on issues of test 
development and research with universities, higher educational associations, states and districts, 
academic associations and other groups. Responsible for hiring / management of vendors and 
academicians to implement research, review test forms and items, and prototype development. 

Specific areas of research include validity of admissions measures, evaluation of educational programs, 
include effects of accommodations and extended time for examines with disabilities, meta-analysis of SAT 
validity, grade inflation trends, and development of new constructs and measures relevant to an expanded 
predictor - criterion space. 

Selected development efforts: (1) Conceived, developed and conducted research resulting in AP Potential 
which is increased access to AP courses by identifying students with potential for success; (2) 2005 SAT 
redesign with writing; (3) Implementation of ECD and AP Redesign work in selected courses; (4) Research 
and transition plan to move AP, SAT and PSAT from formula scoring to rights only scoring; (5) Design on AP 
Portfolio and through course pilot; (6) Accuplacer diagnostic tests and replatforming; (7) Plan to migrate 
most research and selected psychometric operational work to the CB from vendors; and (8) Design of 
CLEP-testlet assessment. 

AMERICAN PSYCHOLOGICAL ASSOCIATION, Washington, D.C. 

Assistant Executive Director for Scientific Affairs and Executive Director of Science (1992 -1994.) Director, 
Scientific Affairs (February 1989 - Aug., 1992) 

Testing and Assessment Officer (Dec. 1987 - Jan. 1989) 

Project Director for the Revision of the Standards for Educational and Psychological Testing and 
Assessment. Managed the technical committee, various technical panels, a financial management 
committee and an executive committee comprised of the Presidents of APA, AERA, and NCME. 

Coordinated and developed all association policies and guidelines in areas of scientific affairs, scientific 
misconduct, research funding, and testing and assessment. Major area of responsibilities in measurement 
and assessment included: (a) monitoring scientific and technical advances; (b) educating policy makers, the 
public, the media, and other professionals (e.g., employers, educators) of the relevance and appropriate 
applications of assessment; (c) developing technical guidance and policy statements that address new and 
emerging areas, reflecting both the scientific and professional consensus in assessment; (d) working 
collaboratively with other professional associations, advocacy groups, and governmental agencies; and (e) 
testimony and advocacy on the efficacy of behavioral science. 

Directed APA involvement in numerous assessment issues at the national level: SCANS, Americans with 
Disabilities Act, national education standards, industry-based skills standards, Civil Rights Act of 1991, 
efficacy of clinical assessment, integrity testing, and test-based accountability initiatives. Assisted in 
developing amicus briefs for Supreme Court, informing policymakers, media, and the public of technical 
advances in assessment (e.g., validation strategies, computer-based and interactive assessments, 
implications of fairness and utility analyses, etc.) and behavioral science research more broadly. 
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GEORGE WASHINGTON UNIVERSITY, Washington DC 

Adjunct Professor of Administrative Sciences and College of Business (1988 -1994) 

Taught graduate seminars in training, performance evaluation, personnel selection, and organizational 
behavior. Served on several doctoral dissertations in 1-0 psychology. 

HUMAN RESOURCES RESEARCH ORGANIZATION, Alexandria, VA, 

Senior Scientist (Feb. 1987 - November 1987), Research Scientist (August 1985 - Feb. 1987) 

Conducted research and managed grants and proposal development in areas of job analysis, competency 
modeling, military testing, training, and personnel selection. Projects including: 

Investigated the utility of algorithms used in computer-based job classification systems employed by 
each branch of the military service. Developed a crosswalk between military occupations in each 
service branch and civilian occupations. 

Project Director and Principal Investigator for contracts funded by the Assistant Secretary of Defense 
and the Navy Personnel Research and Development Center to conduct a longitudinal evaluation of the 
impact of military training and service on subsequent employment/social success of low aptitude 
youth enlisted in the military. 

Developed training and career development system for first-line civilian supervisors in the U.S. Army. 
Provided recommendations for the career development and training of future and incumbent Army 
civilian first-line supervisors. 

Developed training evaluation instruments and conducted evaluation of counselor training in the use 
and interpretation of the ASVAB. 

Managed and conducted several job analysis projects for military and civilian occupations with the 
Department of Defense. 

PERSONNEL SERVICES OFFICE, UNIVERSITY OF ILLINOIS, Champaign, IL 

Human Resources Consultant (1983-85), Illinois State Civil Service 

Designed and managed R&D projects including the development of a computerized adaptive screening 
measure to optimize the matching of jobs and applicants of Civil Service positions. Conducted a large- 
scale job analysis of 70 professional and technical job classifications. Used multiple-rater, multiple- 
method job analyses and applied generalizability theory to interpret findings. Performed validation studies 
of existing civil service exams. 

DEPARTMENT OF PSYCHOLOGY, UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN 

Teaching Assistant and Academic Advisor (1983-85). Research Assistant (1984-85). 

COLLEGE OF EDUCATION, UNIVERSITY OF ILLINOIS 

Psychological Evaluator (1983-84). Administered, scored, and interpreted a variety of psychological and 
cognitive measures. 

BRISTOL COMMUNITY COLLEGE, Fall River, MA 

Lecturer (Spring 1980 -1982), Psychology and Education 
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WEST BRIDGEWATER PUBLIC SCHOOLS, West Bridgewater, MA 

School Psychologist (1979-82) 

Chairperson on team evaluations and reviews. Representative for the school district in out-of-district 
placements, conferences, and regional and state planning meetings. 

Psychological testing - psychodiagnostic and learning assessment - individual IQ tests, projective testing, 
special abilities testing, including two years of clinical supervision. Developed detailed assessment and 
remediation plans for over 150 students. 

PROFESSIONAL AFFILIATIONS: 

American Psychological Association - Elected Fellow, 1994 
Division of Educational Psychology 

Division of Evaluation, Measurement, and Statistics - Elected Fellow, 2002 
Division of General Psychology - Elected Fellow, 1994 
Division of Military Psychology 

Division of International Psychology - Elected Fellow, 2009 
American Psychological Society, Elected Fellow 2007 
American Educational Research Association - Elected Fellow 2008 
International Association of Applied Psychology 
International Test Commission 1997 
National Association of Collegiate Admissions Counselors 
National Council for Measurement in Education 
New York Academy of Sciences - Elected, 2002 
Personnel Testing Council of Metropolitan Washington 
Society for Industrial/Organizational Psychology - Elected Fellow, 1999 

RELATED PROFESSIONAL ACTIVITIES AND AWARDS: 

American Educational Research Association, Division D (Measurement and Research Methodology); 
Co-Chair, Annual Convention Program, 1998. 

American Psychological Association: Appointed APA Member of the Joint Committee on Testing Practices, 
1997-2000; Elected to Council of Representatives, 1997-2000; Member of Joint Science and Practice 
Integration Task Force, 1998; Member CODAPAR, 2004-2007. 

Division of Evaluation, Measurement and Statistics, Member-at-Large, 1997-2000; Chair; 
Professional Affairs, 1995-96, Member, Program Committee, 1992, 1994. 

Division of Military Psychology: Chair, Program Committee, 1988. 

Division of General Psychology: Chair, Membership Committee, 1996; Member-at-Large, 2002- 
2005. 

Associate Editor, Journal of Occupational Health Psychology, 1996 - 1999; Journal of Experimental 
Psychology - Applied, 2001 - 2007; Advisory Editor, Journal of Educational Measurement 2008 -, 
Educational Measurement: Issues and Practice, 2012 -. 

Audit team, Psychometric and measurement graduate program, University of North Carolina at 
Greensboro, spring, 2006. 

Author of numerous technical and policy statements approved by the American Psychological Association 
(e.g., Statement on the disclosure of test data, Statement on the Golden Rule, Resolution on a separate 
directorate for behavioral sciences at NSF). 
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Author, test reviews on numerous employment, organizational and career tests for Buros Mental 
Measurements Yearbook 1994,1996, 1997, 1998, 2000, 2005, 2006, 2008. 

Award for Distinguished Service Contributions, the Society of Industrial and Organizational Psychology, 
2004. 

Award for Professional Contributions and Service to Testing, Association of Test Publishers, 2014. 

Award (to staff unit) for Dissemination of Educational Measurement Concepts to the Public, Code of Fair 
Testing Practices in Education National Council on Measurement in Education, 1989 

Award for New Product Development, Testlet Design for CLEP, Educational Testing Service, 1998. 

Board of Advisors, Center for Enrollment Research Policy and Practice, University of Southern California, 
2008- 

Council of Chief State School Officers, Technical Issues in Large Scale Assessment (TILSA), 2014 - 

Common Core State Standards - Assisted in development and policy oversight in joint effort led by CCSSO 
and National Governors Association (2009-10). 

Editorial Board, International Journal of Selection and Assessment (2001 -06), Military Psychologist (2002 - 
07), Journal of Experimental Psychology: Applied (2001 - 07), Educational Measurement: Issues and 
Practice (2010 - current), NCME Edited Book Series (2014 - current). 

Expert in judicial and regulatory proceedings involving cognitive ability testing, accommodations and score 
comparability in admissions testing, personality testing and disparate impact, job analysis and recruitment 
practices, affirmative action (Gratz v. Bollinger), and copyright infringement on the Standards for 
Educational and Psychological Testing. 

Independent Consultant (selected list), American Council on Education, Goodyear Corporation, American 
Waterways, Federal Reserve Bank of New York, City University of New York, Maryland State Departments 
of Education, Army Research Institute, American Institute for Research, US DOE, Tennessee Department of 
Education, PSI, Wonderlic Inc., employment and labor attorneys and several other organizations in areas 
of employment testing, educational evaluation, college readiness and standard setting, performance 
appraisal systems, and survey research. 

Journal Reviewer: American Psychologist; Educational Measurement: Issues and Practice; Educational 
Researcher; Personnel Psychology; Psychology, Public Policy and the Law; Journal of Occupational Health 
Psychology; Journal of Educational Measurement, Applied Educational Measurement, Journal of 
Educational Measurement, Educational Measurement: Issues and Practice, Journal of Applied Psychology, 
Human Factors, Military Psychologist, etc. 

Media experience: Appeared on national and local television and radio (CNN, Good Morning America, BBC, 
PBS) to discuss the use Civil Rights Act, ADA, personality testing and admissions testing; Frequently quoted 
in major newspaper stories involving testing, 1992- Present. 

Member of International Standards Organization (ISO) Working Group on International Standards on 
Psychological testing (ANSI), 2007 - 2010. 
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National Academy of Sciences, Panelist and participant in workshops sponsored by the Board of Testing 
and Assessment on School-to-Work (1997-98), Collegiate Admissions testing, and Accommodations and 
flagging test scores for disabled test takers (1997, 2003). 

National Council on Education in Measurement, 

Chair, Professional Responsibilities Committee, 1996 - 2000. 

Chair, Career Award Committees, 2015-2016. 

Fund Development Committee, 2013-2016 

Office of Educational Research and Improvement, Technical Review Committee for grants associated with 
the National Assessment of Educational Progress, 1992-1997. 

Society for Industrial and Organizational Psychology: Member of Executive Committee, 1988- 2003; Chair, 
External Affairs Committee, 1993-95; Chair, Awards Committee, 1991-93; Chair, Membership Committee, 
1988-91; Membership Committee, 1987-88; Program Committee 1986-87; 1998-99; Fellowship 
Committee 2007-10; and Distinguished Service Award 2011-13. Designed membership survey and 
developed first SIOP membership directory. 

Standards for Educational and Psychological Testing, AERA, APA and NCME. Staff Director (1992-94); 

Chair, Management Committee (2005-2015). 

Standard Setting Approaches and Policy Capturing for College and Career Readiness (Consultation to 
several states) (2010-current). 

• NAEP linkages and alignment studies with SAT and Accuplacer (2011-12) and ACT, Explore, 
Compass (2014-15). 

• STARR end-of-course examinations, Texas Educational Agency (2012) 

• End-of-course tests, Tennessee Department of Education (2011) 

• Achieve Inc. Algebra II examination (2008-10). 

• New York State (2012-13, through College Board contract). 

• Wyoming, Department of Education (2014, ACT contract). 

• South Carolina, Department of Education (2015, ACT contract). 

Technical or Scientific Advisory Committee Member: 

• Advisory Panel and Steering Panel, Department of Labor-OERI effort to develop assessments to 
measure competencies from the Secretary's Commission on Achieving Necessary Skills (SCANS), 
ACT, Iowa City, IA, 1992-94. 

• American Association of Medical Colleges, Blue Ribbon Technical Panel on additional measures for 
admission to medical colleges, 2012 - 2015. 

• American Diploma Project, Multi-state Algebra assessments, Research Alliance supported by 
Achieve for 16 states, 2007 - 2010. 

• American Institute of Certified Public Accountants, Psychometric Oversight Committee, 2011 - 
2014. 

• Army Research Institute for the Behavioral and Social Sciences, Chair Scientific Review Panel on 
Selection and Classification Program, 2003; Panel member of the technical advisory panel on ABLE, 
2001 - 02 . 

• Congressional Office of Technology Assessment. Technical assistance for study on integrity testing 
in employment settings, 1989-90, and study on performance assessments in school testing, 1991. 

• Delaware State Education Department, Chair TAC on Race to the Top 2011-2013. 
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• Department of Defense, ASVAB testing program, 2000-2008, (chair 2002-2008). 

• International Standards Organization, ISO Standard 10667 (organizational assessment), U.S. team 
on international development committee 2009-2012. 

• Law School Admissions Council (chair), Technical Audit Team 2009. 

• Metametrics, Technical Advisory Committee, 2013-2016. 

• National Assessment of Educational Progress (NAEP), College freshmen technical panel, 2009 - 
2010; Technical Advisory Committee on Standard Setting (Writing), 2010-2014; Advisory panel on 
survey of higher educational institutions on use of assessments for College Readiness and 
Placement, 2011-12. 

• NCAA Data and Analysis Research Group, 2005-2008. 

• Nebraska State Department of Education (TAC) 2008 - 2013. 

• PARCC - Technical Advisory Committee, 2010 - 2013. 

• Pearson Test of English, Scientific Advisory Committee, 2009 - 2013. 

• Pennsylvania State Department of Education (TAC) 2003 - current 

• Psychological Services Inc., employment-certification testing, Scientific Advisory Board, 2011 - 
current 

• Texas State Department of Education (TAC) 2011 - current 

• Technical Advisor Reporting Jointly to Texas Educational Authority and Texas Higher Education 
Coordinating Board 2008. 

• U.S. Department of Education, National Advisory Technical Panel on NCLB Reporting, 2008-11. 

• U.S. Department of Labor, Technical advisor, National Job Analysis and Skills Assessment, 1993-96. 

U.S. Congress Office of Technology Assessment: Reviewer and panelist, Making the ADA work for people 
with psychiatric disabilities in the workplace, 1993. 

Workshop presenter in areas of testing, employment selection and litigation, testing and public policy, 

ADA, work-based learning, testing standards, SIOP Principles, diversity in admissions, ethics in assessment, 
predictive validity, admissions testing, higher educational assessment, and research funding at regional 
applied 1-0 meetings and conferences. 

ELECTED POSITIONS: 

American Educational Research Association, Division D (Measurement and Research Methodology), Vice 
President and Council Representative, 2012-2015. 

American Psychological Association: Council of Representatives, 1997-2003 (SIOP). 

Association of Test Publishers: Board of Directors 2004-2008; Chair 2007; Treasurer 2008-2010. 

Division of Evaluation, Measurement, and Statistics, American Psychological Association: President, 2000- 
01, Member-at-Large, 1997-2000. 

Division of General Psychology, American Psychological Association: Member-at-Large, 2002-2005. 

National Council on Education in Measurement: Board of Directors, 2002-2005, 2009-2012. President 
2010 - 11 . 

Society for Industrial and Organizational Psychology: Member of Executive Committee, 1988- 2003; 

Council Representative, 1997-2003; Member-at-Large, 1995-97 

TESTIMONY: 
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California Legislature on Test validity and consequences of subgroup differences in ability testing, 1997. 

Invited testimony before the National Commission on Testing and Public Policy, 1989 

Maine Joint Committee on Education and Cultural Affairs on the subject of Legislative Document No. 843 - 
- H.P. 1283, January 17, 2006. 

Michigan Senate Education Committee, on the replacement of the MAEP and the use of admissions tests 
for accountability, April 22, 2004. 

National Advisory Commission on Work-Based Learning, 1992 - 93. 

National Assessment Governing Body, panel on testing persons with disabling conditions, October 14, 

1998. 

National Research Council's Committee on National Research Service Awards, May 1993. 

Nevada Legislative Hearing on College and Career Readiness, Reno, NV., May 2012. 

New York Assembly Committee, Test Disclosure, 1990. 

New York Senate Committee, Proposed legislation to regulate admissions testing, 2006. 

U.S. Congress, House Education and Labor Subcommittee, Goals 2000,1993. 

U.S. Congress, House Appropriations Subcommittee, Research Funding in Behavioral Sciences, 1992. 1993. 

U.S. Congress, Senate and House Committees, Prepared testimony for APA presented on Civil Rights Act, 
Polygraph Protection Act, Integrity Testing, American 2000, Americans with Disabilities, and 
Appropriations. 

U.S. Department of Education Hearings on Common Assessments for College and Career Readiness, 
November 2009. 

EXTERNAL GRANTS (PROJECT DIRECTOR): 

National Assessment Governing Board (NAGB) (2008-10). Co-Project Director. Alignment and linkage of 
Twelfth grade NAEP and the SAT. 

Southern Regional Educational Board (2000-01). Project Manager. Design and development of common 
Algebra assessment and item bank. 

Office of Educational Research and Improvement (1996-98). Project Manager. Research grant to examine 
the generalizability and utility of local models for scoring performance assessments. Working 
collaboratively with six school districts examining different models for local scoring of Pacesetter 
culminating assessments. 

Maryland Department of Education (1996-97). Project Director. Contract to design Maryland's High School 
Assessment System. Designing requirements and specifications for an end-of-course assessment system 
for high school graduation and higher education uses. Conducting public engagement with stakeholder 
groups and advising the state board. 
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National Institute of Occupational Health and Safety (1992-1994). Project Director. Cooperative agreement 
to develop a model interdisciplinary program to train doctoral level psychologists in occupational health 
psychology and disseminate research on preventive interventions to policymakers, psychologists and 
researchers. 

Department of Labor, (July, 1992-93). Project Director. Grant to support a review of methodologies and 
strategies in cognitive psychology and job analysis appropriate for the next revision of the "Dictionary of 
Occupational Titles." 

National Institute on Drug Abuse, (February 1992). Co-Project Director. Examination of awareness and 
knowledge of the mechanisms for receiving outside funding to support research by recent doctoral degree 
recipients in psychology. 

National Science Foundation, Principle Investigator or Co-PI on several contracts related to AP Redesign 
and Instructional development. 

PRODUCT DEVELOPMENT: 

Business and Project Plan for Computerized CLEP Examinations. Award for New Product Development, 
Educational Testing Service, 1998. 

Led CB/ETS psychometric/research and redesign teams for the 2005 SAT with writing. 

Prototype of Non-cognitive assessments for college admissions. Pilot testing in 2007-08 with applicants 
across 13 colleges. 

Psychometric Research and Design of AP Potential Software. Product introduced by College Board in 2001 
for expanding access in AP Courses and Examinations based on prior accomplishments and test 
performance. 

Study Skills Inventory for high school and college freshmen. Prototype completed and product in 
development, 2005-2009. 

SELECTED BIBLIOGRAPHY: 

Camara, W.J., O'Connor, R., Mattern, K., and Hanson, M.A. (Eds.). (2015). Beyond academics: A holistic 
framework for enhancing education and workplace success. ACT Research Report 2015 (4). Iowa City, IA: 
ACT. Retrieved from http://www.act.org/research/researchers/reports/pdf/ACT RR2015-4.pdf 

Mattern, K., Burrus, J., Camara, W.J., O'Connor, R., Hanson, M.A., Gambrell, J., Casillas. A., and Bobek, B. 
(2014). Broadening the definition of college and career readiness: A holistic approach. ACT Research 
Report 2014 (6). Iowa City, IA: ACT. Retrieved from 

http://www.act.org/research/researchers/reports/pdf/ACT RR2014-5.pdf 

[RjCamara, W.J. (2014). Issues facing testing organizations in using the Standards for Educational and 
Psychological Testing. Educational Measurement: Issues and Practice, 33 (4) 13-15. 

[R] Camara, W.J. (2013). Defining and measuring college and career readiness: A validation framework for 
new State consortium assessments. Educational Measurement: Issues and Practice, 32 (4) 16-27. 
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[R] Camara, W.J., Packman, S. and Wiley A. (2013).College, graduate and professional school admissions 
testing. In K. Geisinger (Ed.), Handbook of Testing and Assessment in Psychology (pp. 297-318). 

Washington, D.C: American Psychological Association. 

[R] Camara, W.J. and Shaw, E. (2012). Tests, score reports, research and getting along with the media. 
Educational Measurement: Issues and Practice. 

[R] Harris, W.G., Jones, J.W., Klion, R., Arnold, D.W., Camara, W.J., and Cunningham, M. R. (2012). Test 
publisher's perspective on "An updated meta-analysis" of integrity testing. Journal of Applied Psychology, 

97 (3), 531-536. 

[RJMattern, K., Kobrin, J., and Camara, W.J. (2012). Promoting Rigorous Validation Practice: An Applied 
Perspective. Measurement: Interdisciplinary Research and Practice, 10, pp 88-92. 

Camara, W.J. and Quenemoen, R. (2012). Defining and Measuring College and Career Readiness and 
Informing the Development of Performance Level Descriptors (PLDs). Commissioned white paper for 
PARCC. Available at http://www.parcconline.org/sites/parcc/files/PARCC%20CCR%20paper%20vl4%201- 
8-12.pdf 

Wyatt, J., Wiley, A., Camara, W.J., and Proestler, N. (2011). The development of an index of academic rigor 
for college readiness. College Board Research Report (2011-11). New York, NY: College Board. Available at 
http://professionals.collegeboard.com/profdownload/pdf/RR2011-ll.pdf 

Luecht, R. and Camara, W.J. (2011) Evidence and design implications reguired to support comparability 
claims. Commissioned white paper for PARCC. Available at 

http://parcconline.org/sites/parcc/files/PARCC WhitePaper-RLuechtWCamara%5B5%5D.pdf 

Wyatt, J., Kobrin, J., Wiley, A., Camara, W.J., and Proestler, N. (2011). SAT Benchmarks: Development of a 
college readiness benchmark and its relationship to school performance. College Board Research Report 
(2011-5). New York, NY: College Board. Available at 
http://professionals.collegeboard.com/profdownload/pdf/RR2011-5.pdf 

Wiley, A., Wyatt, J., and Camara, W.J. (2010). Development of a multidimensional index of college 
readiness. College Board Research Report (2010-03). New York, NY: College Board. Available at 
http://professionals.collegeboard.com/profdownload/pdf/10b 3110 CollegeReadiness RR WEB 110315. 

pdf 

[R] Packman, S.J. Camara, W.J. and ,Huff, K., (2010). A snapshot of industry and academic compensation in 
educational measurement and assessment. Educational Measurement: Issues and Practice, Fall. 

[RJCamara, W. J. (2009). Validity Evidence in Accommodations for English Language Learners and Students 
Disabilities. Journal of Applied Testing Technology, 10 (2). http://www.testpublishers.org/jattmain.htm 

Mattern, K. Kobrin, J., Patterson, B., Shaw, K. and Camara, W.J. (2009). Validity is in the eye of the beholder: 
Conveying SAT research findings to the general public. In R. Lissitz (ed.) The Concept of Validity: Revisions, New 
Directions and Applications. Charlotte, NC: Information Age Publishing 

Camara, W.J. (2009). College Admission Testing: Myths and Realities in an Age of Admissions Hype. In R. 
Phelps (Ed.), Correcting fallacies about educational and psychological testing (pp. 147-180). Washington, 

DC: American Psychological Association. 
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[RJCamara, W. J. and Lane, S. (2006). A historical perspective and current views on the Standards for 
Educational and Psychological Testing. Educational Measurement: Issues and Practice, 25, pp. 35-41. 

Camara, W.J. (2006). Improving Test Development, Use, and Research: Psychologists in Educational and 
Psychological Testing Organizations. In R. Sternberg (Ed.), Careers in Psychology. Washington, DC: 

American Psychological Association. 

[RJPhillips, S. & Camara, W. J. (2006). Legal and ethical issues in testing. In R. Brennan (Ed.), Educational 
Measurement (Volume IV) (pp. 733-755). AERA and American Council on Education. 

Camara, W. J. & Kimmel, E. (Eds.) (2005). New tools for admissions to higher education. Mahwah, NJ: 
Erlbaum. 

Camara, W. J. (2005). Broadening criteria of college success and the impact of cognitive predictors in 
admissions testing (pp. 81-107), In W.J. Camara & E. Kimmel (Eds.), New tools for admissions to higher 
education. Mahwah, NJ: Erlbaum. 

Camara, W. J. (2005). Broadening predictors of college success (pp. 53-80). In W.J. Camara & E. Kimmel 
(Eds.), New tools for admissions to higher education. Mahwah, NJ: Erlbaum. 

Korbrin, J., Camara, W.J., & Milewski, G. (2004). The utility of the SAT I and SAT II for admissions. In R. 

Zwick (Ed.), Rethinking the SAT: The future of standardized testing in university admissions (pp. 251-276). 
New York: Routledge Falmer. 

Schmidt, A.E., & Camara, W.J. (2004). Group differences in standardized test scores and other educational 
indicators. In R. Zwick (Ed.), Rethinking the SAT: The future of standardized testing in university admissions 
(pp. 189-202). New York: Routledge Falmer. 

[RJCahalan, C., Mandinach, E. & Camara, W.J. (2003). The impact of flagging on the admissions process. 
Journal of College Admissions. No. 186. 

Camara, W.J. (2003). What educators need to know about professional testing standards. In J. Wall & G. 
Walz (Eds.), Measuring up: Resources on testing for teachers, counselors, and administrators. Greensboro, 
NC: ERIC/CASS. 

Noble, J. and Camara, W. (2003). Issues in college admissions testing. In J. Wall & G. Walz (Eds.), 

Measuring up: Resources on testing for teachers, counselors, and administrators. Greensboro, NC: 
ERIC/CASS. 

Camara, W.J., Kimmel, E., Scheuneman, J., and Sawtell, E. (2003). Who's grades are inflated? College 
Board Research Report (2003-4). New York: College Board. Available at 
http://professionals.collegeboard.com/profdownload/pdf/04843cbreport20034 31757.pdf 

Camara, W.J. (2003). Construct validity. In R. F. Ballesteros (Ed.) Encyclopedia of Psychological Assessment 
(Volume 2) (pp. 1070-1075). London: Sage Publications. 

Camara, W.J. (2002). Advances in scoring and inferences concerning examine behavior in computer- 
adaptive testing. In Mills, C., Potenza, M, and Ward, W. (Eds.) Computer-Adaptive Testing: Building a 
foundation for future assessments. Mahwah, NJ: Lawrence Erlbaum Associates, Inc. 


JA2563 



Case l:14-cv-00857-TSC Document 60-76 Filed 12/21/15 Page 21 of 32 
USCA Case #17-7035 Document #1715850 Filed: 01/31/2018 Page 260 of 517 

W. Camara 12 

Linn, R.L., Drasgow, F., Camara, W., Crocker, L., Hambleton, R.K., Plake, B.S., Stout, W. and van der Linden, 
W.J. (2002). Examinee behavior and scoring computer-based tests. In C. Mills, M. Potenza, J. Fremer and 
W. Ward (Eds.) Computer-based testing: Building the foundation for future assessment. Mahwah, NJ: 
Lawrence Erlbaum Associates, Inc. 

Noble, J., Camara, W., and Fremer, J. (2002). Admissions testing and students with disabilities. In Ekstrom, 
R., and Smith, D. (Eds.) Assessing individuals with disabilities (pp. 173-190). Washington, DC: American 
Psychological Association. 

Mandinach, E.B., Cahalan, C. and Camara, W.J. (2002). The impact of flagging on the admissions process: 
Policies, practices and implications. College Board Research Report (No. 02-02). New York: College Board. 
Available at http://www.ets.org/Media/Research/pdf/RR-05-20.pdf 

Cahalan, C., Mandinach, E.B., and Camara, W.J. (2002). Predictive validity of STA I: Reasoning test for test 
takers with learning disabilities and extended time accommodations. ETS Research Report (No. 02-03). 
Princeton, NJ: ETS. Available at http://www.ets.org/Media/Research/pdf/RR-02-03-Mandinach.pdf 

[RJScheuneman, J.D., Camara, W.J., Cascallar, A.S., Wendler, C., and Lawrence, I (2002). Calculator access, 
use and type in relation to performance on the SAT I: Reasoning test in mathematics. Applied 
Measurement in Education, 15 (1), 95-112. 

Camara, W.J. and Echternacht (2000). The SAT I and high school grades: utility in predicting success in 
college. College Board Research Note (RN-10). New York: College Board. 

Camara, W. (2000). Using class rank alternative plans for college admissions. Association of American 
Colleges and Universities: Diversity Digest, Summer, 8-10. 

[RJCamara, W.J., Dorans, N., Morgan, R. and Myford, C (2000). Advance Placement: Access not quality. 
Educational Policy Analysis Archives. 8(40). Online journal, available: 
http://epaa.asu/edu/epaa/v8n40.html 

[RJCamara, W.J. and Merenda, P.M. (2000). Using personality tests in preemployment screening: Issues 
raised in "Soroka v. Dayton-Hudson Corporation." Psychology, Public Policy and the Law, 6, (4), 1-23. 

[RJCamara, W., Puente, A., and Nathan, J. (2000). Psychological test usage: Implications in professional 
psychology. Professional Psychology: Research and Practice, 31 (2), 141-154. 

Camara, W. and Schmidt, A. (1999). Group differences in standardized testing and social stratification. 
College Board Research Report (No. 99-5). New York: College Board. 

[RJSchneider, D.L., Camara, W.J., Tetrick, L. and Sternberg, C. (1999). Training in occupational health 
psychology: Initial efforts and alternative models. Professional Psychology: Research and Practice, 30 (2), 
138-142. 

Nathan, J., and Camara, W.J. (1999). Concordance of the examine performance on the SAT and ACT. 

College Board Research Note (RN99-7). New York: College Board. 

Powers, D. and Camara, W. (1999). Coaching and the SAT I. College Board Research Summary (RN99-6). 

New York: College Board. 
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Camara, W.J. and Millsap, R. (1998). Using the PSAT/NMSQT and course grades in predicting success in 
Advanced Placement. College Board Research Report (RR98-5). New York: College Board. 

Camara, W.J., Copeland, T. and Rothschild, B. (1998). Effects of extended time on the SAT I: Reasoning 
Test score growth for students with learning disabilities. College Board Research Report (No. 98-7). New 
York: College Board. 

Nathan, J., and Camara, W.J. (1998). Score change when retaking the SAT I Reasoning Test. College Board 
Research Note (RN98-5). New York: College Board. 

Camara, W. (1998). High school grading policies. College Board Research Note (RN98-4). New York: College 
Board. 

Smith, R. and Camara, W. (1998). Block schedules and student performance on AP examinations. College 
Board Research Note (RN98-3). New York: College Board. 

Camara, W., Kimmel, E., et. al., (1997). Design of a high school assessment system. (Vols. I and II). 

Technical Report of the College Board and ETS. Baltimore, MD. Maryland State Department of Education. 

[RJCamara, W. J. (1997). Educational assessment: Responsible uses and professional dilemmas. European 
Journal of Psychological Assessment, 13 (2), 140-152. 

Camara, W.J. and Kraiger, K. (1997). Organisational infrastructure for selection and assessment in the USA. 
In Smith, M & Sutherland, V. (Eds.). Professional issues in selection and assessment (pp. 139-146). Wiley: 
London. 

Camara, W. (1997). Validity, Fairness, and Public Policy of Employment Testing: 

Influences of the American Psychological Association (pp. 3-11). In Barrett, R., Fair employment 
strategies. 

Camara, W., Kimmel, E. and colleagues (1997). Models for the Design of Maryland's High School 
Assessments. College Board Technical Report (CBTR97-1). 

[RJCamara, W.J. and Schneider, D. (1995). Questions of construct breadth and openness of research on 
integrity tests. American Psychologist, AT (3). 

[RJCamara, W. J. and Brown, D. (1995). Educational and employment testing: Changing concepts in 
measurement and policy. Educational Measurement: Issues and Practice, 14 (1), 5-12. 

Camara, W. J. (1995). APA involvement in employment testing policy and litigation: An historical overview. 
Unpublished manuscript. 

[RJCamara, W.J. and Schneider, D. (1994). What we know and still don't know about integrity tests. 
American Psychologist, 47 (3). 

Camara, W.J., and Baum, C. (1993). Developing careers in research: Knowledge, attitudes and intentions of 
recent doctoral recipients in psychology. (Final report 92MF04400101D) Rockville, MD, National Institute 
of Drug Abuse. 
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Camara, W.J. (1992). Fairness and "fair-use" in employment testing: A matter of perspectives (pp. 215- 
233). In Geisinger, K, Testing of Hispanics. Washington, DC: APA 

[RjCamara, W.J. (1991). A national exam: Has its time come? Child Behavior & Development, 7 (9-10). 

[RjCamara, W.J., et. al. (1990). Enhancing psychological science: A report by the Science Advisory 
Committee. American Psychologist, 45 (7). 

[RjFremer, J., Diamond, E., and Camara, W. (1989). Developing a "Code of Fair Testing in Education." 
American Psychologist, 44 (7), 1062-1067. 

[RjBond, L., Camara, W.J., and VandenBos, G.R. (1989). Psychological test standards and clinical practice. 
Hospital and Community Psychiatry, 40 (7), 687-693. 

Camara, W.J. (1989). Detecting dishonest employees: What is the state of the art? Proceedings of the 
Second Annual National Assessment Conference, (pp.26-28) University of Minnesota and Personnel 
Decisions Inc., Minneapolis, MN. 

Camara, W.J., Kuhn, D., and Ziemak, J. (1987). Development and training of Army civilian first-line 
supervisors. (Final Report FR-87-36). Alexandria, VA. Human Resources Research Organization. 

[RjWaters, B.K., Laurence, J.H., & Camara, W.J. (1987). Personnel enlistment and classification procedures 
in the U.S. Military. Washington, D.C.: National Academy of Science Press. 

Camara, W.J., & Laurence, J.H. (1987). Military classification and high aptitude recruits (Technical Report 
TR-PRD-87-16). Alexandria, VA. Human Resources Research Organization. 

Camara, W.J. (1986). The effects of job previews on self-selection decisions. Dissertation Abstracts 
International, 47, DA8623268. 

Camara, W.J. (1984). Assessment centers: A critical review of the literature. Unpublished Paper, 
Champaign-Urbana: University of Illinois. 

Camara, W.J. (1983). Personnel selection: A classification and review of techniques. Unpublished Masters 
Thesis, Champaign-Urbana: University of Illinois. 

Camara, W.J. (1981) Infusion - inservice: Career awareness. A Massachusetts guide: Promising practices 
in career education. Boston, MA: Department of Education. 

SELECTED PRESENTATIONS: 

Camara, W.J. (2015). Employing empirical data in judgmental processes. Paper presented at the National 
Conference on Student Assessment, San Diego, CA. 

Camara, W.J. and Westrick, P. (2015). Admissions testing in the United States. Invited presentation at the 
Annual Meeting of the American Educational Research Association in Chicago, IL. 

Camara, W.J. (2015). "Evidentiary basis related to claims concerning college and career readiness." 
Colloqium, University of Massachusetts, Amherst, Graduate Program in Education. 
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Camara, W.J. (2015). Overview of the 2014 Revision of the 'Standards for Educational and Psychological 
Testing.' Paper presented at the Association of Test Publishers, Palm Springs, CA. 

Camara, W.J. (2014). Test security: Prevention-detection-investigation. Workshop presentation for the 
Minnesota State Department of Education, Offices of Assessment and Accountability. 

Camara, W.J. (2014). How has our approach to test security evolved and where are we headed. Paper 
presented at the Conference on Test Security, Iowa City, IA. 

Camara, W.J. (2014). Developing sources of validation evidence across assessment settings. Invited 
presentation at the International Testing Commission, San Sebastian, Spain. 

Camara, W.J. and Shaw, D. (2014). Use of comment codes during performance scoring to provide 
formative feedback. Paper presented the National Conference on Student Assessment, New Orleans, LO. 

Camara, W.J. (2014). Employing empirical data in judgmental standard setting processes. Paper presented 
at the Annual Meeting of the Society for Industrial and Organizational Psychology, Honolulu, HI. 

Camara, W.J. (2014). Fisher v. University of Texas: The future of affirmative action. Participant in panel at 
the Annual Meeting of the Society for Industrial and Organizational Psychology, Honolulu, HI. 

Camara, W.J. (2014). AERA Vice-Presidential Symposia: Technology Enhanced Items in Large Scale 
Assessments. Annual Meeting of the American Educational Research Association, Philadelphia, PA. 

Camara, W.J. (2013). PISA's use for international benchmarking and comparisons of post-secondary 
readiness. Invited panel, Oxford University. 

Camara, W.J. (2013). College and career readiness: Criterion-related outcomes. Invited address at the 
Maryland Assessment Research Center for Educational Success, University of Maryland at College Park. 

Camara, W.J. (2013). Implications of consortia assessments for Higher Education. Paper presented at the 
National Conference on Student Assessment at the National Harbor, MD. 

Camara, W.J. (2013). Innovations in psychometrics and assessment. Developing college and career 
readiness assessments. Workshop at the National Center for Measurement in Education, San Francisco, 

CA. 

Camara, W.J. (2012). Admissions practices and college going in the U.S. Invited presenter at the 
International Conference on Assessment and Evaluation, Riyadh, Saudi Arabia. National Center on 
Assessment in Higher Education. 

Reshetar, R., and Camara, W. J. (2012). Redesigning the Advanced Placement Science Assessments 
application of evidence centered design. Invited Panelist at the National Research Council Workshop. 

Camara, W.J. (2012). College and career readiness: Establishing validation evidence to support the use of 
new assessments. Invited lecture at the Pearson Center for Applied Psychometric Research, University of 
Texas at Austin. 
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Camara, WJ. (2012). Defining and measuring college and career readiness: Developing Performance level 
descriptors and defining criteria. Paper presented at the Annual Meeting of the National Council on 
Measurement in Education, Vancouver, Canada. 

Camara, W.J. (2012). Invited panel presentation on data integrity and cheating. National Center on 
Educational Statistics. Sponsored Symposium on Testing Integrity, Washington, D.C. 

Camara, W.J. (2011). College and career readiness: An initial validation argument. Paper presented at the 
National Conference on Student Assessment, Orlando, FL. 

Camara, W.J. (2011). Developing and expanding state K-20 longitudinal data systems: Common core state 
standards and consortia assessments. Paper presented at the National Conference on Student 
Assessment, Orlando, FL. 

Camara, W.J. (2011). The revised testing standards: Potential impact and consequences for assessments in 
employment and business settings. Invited address at the International Personnel Assessment Council, 
Washington, DC. 

Camara, W.J. (2011). College and career readiness standards and assessments: An initial validation 
argument. Paper presented at the CCSSO National Conference on Student Assessment, Orlando, FL. 

Camara, W. J. (2011). Empirical benchmarks in a judgmental standard setting process. Paper presented at 
the Annual Meeting of the National Council on Measurement in Education, New Orleans, LA. 

Camara, W. J. (2011). Uncovering Educational Measurement & Assessment Professionals: Demographics, 
Education, Experience and Engagement. Presidential Address at the Annual Meeting of the National 
Council on Measurement in Education, New Orleans, LA. 

Camara, W.J. (2011). Formative assessment: Implications of the common core on classroom assessment. 
Invited address at the Annual Meeting of the American Educational Research Association, Classroom 
Assessment SIG. 

Camara, W. J., Wiley, A., Wyatt, J., and Kobrin. J. (2011). College readiness benchmarks. Paper presented 
at the Annual Meeting of the National Council on Measurement in Education, New Orleans, LA. 

Camara, W.J. (2010). Validating claims and evidence related to student college and career readiness: 
Lessons learned from higher education. Invited presentation at the Annual CCSSO Policy Meeting, 

Louisville, KY. 

Camara, W.J. (2010). Developing benchmarks for college and career readiness. Paper presented at the 
Annual Meeting of the National Council on Measurement in Education, Denver, CO. 

Camara, W.J. (2010). Multidimensional models of college readiness. Paper presented at the Large Scale 
Assessment Conference, Detroit, Ml. 

Camara, W. J. (2010). Progress in revising the Standards for Educational and Psychological Testing, the 
Annual Conference of the American Psychological Association, San Diego, CA. 

Camara, W.J. (2009). Operational Issues in Developing National Admissions Testing and College Credit 
Testing Programs in the U.S. Invited Colloquium at the University of Aachen, Germany. 
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Camara, WJ. (2009). Common Core Standards and Coordinated State Assessment. Invited Symposium at 
the Annual Meeting of the American Educational Research Council, Denver, CO. 

Camara, W.J. (2009). You can get there from here: Innovation in Educational assessment and linking 
accountability tests. Invited address at the National Conference of State Legislators, Washington, DC. 

Camara, W. J. (2009). Noncognitive assessments in college admissions. Paper presented at the Annual 
Conference of the American Psychological Association, Toronto, Canada. 

Camara, W., Kobrin, J., Mattern, K., Patterson, B., and Shaw, E. (2008). The Long and Winding Road: 
Researching the Validity of the SAT. Invited paper at the 9 th annual conference of the Maryland 
Assessment Research Center for Education Success (MARCES), College Park, MD. 

Camara, W. J. (2008). Innovations in assessment. Presenter at the Invitational Conference Educational 
Testing in America: State Assessment, Achievement Gaps, Federal Policy and Innovations, Sponsored by 
ETS and the College Board, Washington, DC. 

Camara, W. J. (2008). College readiness vs college admissions: Will we ever resolve the chasm between the 
K-12 and Higher Education? Invited address at Invitational Conference on Defining Enrollment in the 21 st 
Century, sponsored by the University of Southern California's Center for Enrollment Research, Policy and 
Practice. 

Camara, W. J. (2008). Diversity in admissions. Invited Address at the ETS Conference of Institutional 
Researchers, Measuring Success and Making Assessment Data Work at Your Institution, Princeton, NJ. 

Camara, W. J. (2008). The educational measurement profession: state of our art. Presentation at the 
annual meeting of the National Council on Measurement in Education, New York, NY. 

Camara, W. J. (2007). Protecting test takers. Invited Presidential Symposium at the Annual Conference of 
the American Psychological Association, San Francisco, CA. 

Camara, W. J. (2007). Revising the standards for educational assessment. Invited Symposium at the 
Annual Meeting of the American Educational Research Association, Chicago, IL. 

Camara, W. J. (2006). Using norm referenced tests for accountability under NCLB. Presenter at the Annual 
Meeting of the National Association of Collegiate Admissions Counselors, Pittsburgh, PA. 

Camara, W. and Schmidt, A. (2006). University Admissions Practices in the US and the Role of Admissions 
Tests. Invited Address at UCAS Conference, Nottingham, United Kingdom. 

Camara, W J. (2005). Constraints in current admissions practices: Impacts on diversity and definition of 
college success. Invited Address at the Goldman-Sachs Foundation and ETS Symposium on Addressing 
Achievement Gaps, Princeton, NJ. 

Camara, W. J. (2005). Update on the new SAT. Annual Meeting of the National Association of Collegiate 
Admissions Counselors, Tampa, FL. 

Camara, W. J. (2005). Design and development of the SAT Writing Test. National Council on Measurement 
in Education, Montreal, Canada. 
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Camara, W.J., Kobrin, J., and Sathy, J (2005). Is there an SES advantage on the SAT and college 
performance? National Council on Measurement in Education, Montreal, Canada. 

Camara, W. (2004). The Use of Qualitative and Quantitative Data in Admissions. Annual Meeting of the 
National Association of Collegiate Admissions Counselors, Milwaukee, Wl. 

Laitusus, V. Camara, W. J. and Wang, B. (2004). An examination of differential item functioning for 
language minorities on a verbal and math reasoning test. National Council on Measurement in Education, 
San Diego, CA. 

Camara, W.J. (2004). New predictors in college admissions. Annual Meeting of the Society of Industrial 
and Organizational Psychologists, Chicago, IL. 

Camara, W. J. (2003) Current tests and future designs in admissions testing: The new SAT. CASMA-ACT 
Invitational Conference. Iowa City, IA. 

Camara, W. J. (2003). Validity and utility of admissions tests. Invited symposium, American Council on 
Education, Washington, DC. 

Camara, W.J. (2003). Changes to the SAT: National Association of Collegiate Admissions Counselors. 

Camara, W. (2003). Making test results more useful and understandable: Advances in diagnostic score 
reporting. Paper presented at the Annual Meeting of the National Council on Measurement in Education, 
Chicago, IL. 

Camara, W. (2002). Revision of the Principles for the Validation and Use of Personnel Selection 
Procedures, Workshop conducted at the Mid-Atlantic Personnel Assessment Consortium, New York, NY. 

Camara, W. (2002). Predicting success in employment and education: Uses and limitations of tests and 
other factors. Invited address, New York Academy of Sciences. 

Camara, W. (2002). Prediction and testing. Invited address at the CRESST Conference on Assessment, 
Accountability and Improvement, Los Angeles, CA. 

Camara, W. (2002). Admissions tests : Use and value in higher education. Invited address the Association 
of American Universities, Meeting of Presidents and Chancellors, Atlanta, GA. 

Camara, W. (2002). The future of admissions testing. Annual Meeting of the National Association of 
Collegiate Admissions Counselors, Salt Lake City. 

Camara, W. (2002). Fairness in employment testing. Paper presented at the Annual Convention of the 
American Psychological Association, Chicago, IL. 

Camara, W. (2002). Testing and admissions in higher education. Invited presentation at the Annual 
Meeting of the American Association for the Advancement of Science, Boston, MA. 

Camara, W. (2001). The utility of the SAT I and SAT II for admission at the University of California and the 
nation. Paper presented at the Invitational Conference on Rethinking the SAT in university admissions, 
University of California at Santa Barbara. 
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Camara, W. (2001). Test preparation on the SAT: Impact on validity. Paper presented at the Annual 
Meeting of the National Council on Measurement in Education, Seattle, WA. 

Camara, W. (2001). Utility of the SAT in college admissions. Colloquium at the University of California at 
Davis. 

Camara, W. (2001). Do accommodations improve or hinder psychometric qualities of assessment? 
Presidential address for Division 5 at the Annual Convention of the American Psychological Association, 

San Francisco, CA. 

Camara, W. (2000). Future of educational assessment. Paper presented at the Annual Convention of the 
American Psychological Association, Washington, D.C. 

Camara, W. (2000). Implications of the revised testing standards for personnel selection. Invited Address 
at the Annual Conference of the International Personnel Management Assessment Council, Washington. 

Camara, W. (2000). Performance of test takers with LD or ADD on the SAT and subsequent college 
behavior. Paper presented at the Annual Meeting of the American Educational Research Association, New 
Orleans, LA. 

Camara, W. (2000). Implications of the testing standards in personnel assessment. Presentation at the 
Annual Meeting of the Society for Industrial and Organizational Psychology, New Orleans, LA. 

Camara, W. (1999). The revised 'Standards for Educational and Psychological Testing'. Workshop at the 
Mid-Atlantic Personnel Assessment Consortium, New York, NY. 

Camara, W. (1999). Retesting on the SAT under standard and non-standard administrations. Presentation 
at the National Conference of Measurement in Education, Montreal, Canada. 

Camara, W. (1999). Testing practices in clinical assessment. Paper presented at the American Psychological 
Association, Boston, MA. 

Camara, W. (1998). Accommodations for persons with disabilities: results of attempts to establish 
comparability in cognitive testing. Continuing education workshop at Personnel Testing Council of 
Metropolitan Washington. 

Camara, W. (1998). Alternatives to item pattern scoring and use of response-time estimation in computer 
adaptive testing: Invited Presentation. ETS Invitation Conference on future assessments, Philadelphia, PA. 

Camara, W. (1998). Future trends in assessment. Presentation at the Annual Conference for the Society for 
Industrial and Organizational Psychology, Dallas, TX.. 

Camara, W. (1998). Selection into 1-0 programs: Focus on GRE validity. Symposium at the Annual 
Conference for the Society for Industrial and Organizational Psychology, Dallas, TX 

Camara, W. (1998). Rights and responsibilities of test takers. Presentation at the Annual Meeting of the 
National Council on Measurement in Education, San Diego, CA. 
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Scheuneman, J. and Camara, W. (1998). Analysis of mathematics achievement in Pacesetter program. 
Presentation at the Annual Meeting of the American Educational Research Association, San Diego, CA. 

Camara, W. (1998). Evaluating math curricular reform efforts. Presentation at the Annual Meeting of the 
American Educational Research Association, San Diego, CA. 

Camara, W. (1998). Psychometric and operational constraints remaining in CBT. Colloquium at Fordham 
University Graduate Departments of Psychology and Education, New York, NY. 

Camara, W. (1997). State and district accountability: Uses and misuses of assessments. Presentation at 
the Large Scale Assessment Conference of the Council of Chief State School Superintendents, Colorado 
Springs, CO. 

Camara, W. (1997). Effects of calculator use on performance of on a mathematics admissions test. Panel 
discussant at the Annual Meeting of the American Educational Research Association, Chicago, IL. 

Camara, W. (1997). Assessing workplace skills: Public policy and technical considerations. Colloquium at 
Baruch College, City University of New York. 

Camara, W. (1996). Effects of extended time on the performance of students with disabilities. Paper 
presented at the Annual Conference of the National Association of College Admissions Counselors, 
Minneapolis, MN. 

Camara, W. (1996). Flagging test scores for students with disabilities: Understanding and using test scores 
for admissions and placement decisions. College Board National Forum, New York, NY 

Camara, W. (1996). Adapting/ Translating educational and psychological tests: Issues, technical advances, 
and guidelines. Panel discussion at the Annual Convention of the American Psychological Association, 
Toronto, Canada. 

Camara, W. (1996). Doctoral training in organizations: Comparisons among Business schools and 
Psychology departments. Panel discussion at the Annual Meeting of the Society for Industrial and 
Organizational Psychology, San Diego, CA. 

Camara, W. (1996). SCANS based competencies: Discussion of the national job analysis project. Discussant 
at the Large Scale Assessment Conference of the Council of Chief State School Superintendents, Phoenix, 
AZ. 

Camara, W. (1995). Test speededness and other implications of testing persons with disabilities in large- 
scale programs. Paper presented at the October Meeting of the Personnel Testing Council, Washington, 
DC., and (1996) Mid-Atlantic Personnel Assessment Consortium, Potomac, MD. 

Camara, W. (1995). Flagging of test scores: Policies, Data and Opportunities. Panel discussion, ETS 
Committee for People with Disabilities, Educational Testing Service, Princeton, NJ. 

Camara, W. (1995). Lessons for test developers from the NACAC Commission on Standardized Testing. 
Paper presented at the Annual Conference of the National Association of College Admissions Counselors, 
Boston, MA. 
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Camara, W. (1995). Standard setting: A mixed bag of judgment, psychometrics and policy. Paper 
presented at the Annual Convention of the American Psychological Association, New 
York, NY. 

Camara, W. (1995). Federal funding opportunities in Industrial and Organizational Psychology (moderator 
and presenter). Annual Conference of the Society for Industrial and Organizational Psychology, Orlando, 

FL. 

Camara, W. (1995). The New SAT: Reactions (chair) Symposium at the Annual Meeting of the National 
Council on Measurement in Education, San Francisco, CA. 

Camara, W. (1994). International perspectives on test use: Options in education and enforcement? 
Presentation at the 23rd International Congress of Applied Psychology, Madrid Spain. 

Camara, W, (1994). Developments in creating the new national database of occupational titles (chair and 
presenter). Society for Industrial and Organizational Psychology, Nashville. 

Camara, W. (1994). The impact of national testing standards on personnel assessment. Invited 
presentation at the International Personnel Management Association Assessment Council Meeting, 
Charleston, SC. 

Camara, W. (1994). Test standards: Balancing technical, applied and policy issues. Invited presentation, 
Personnel Testing Council of Washington, DC. 

Camara, W. (1993). Who should control access and use of Neuropsychological Tests? Invited presentation 
at the National Academy of Neuropsychology, Phoenix, AZ. 

Camara, W. (1993). Implications of the "Americans with Disabilities Act" on Assessment, Invited address at 
International Personnel and Management Association, Sacramento, CA. 

Camara, W. (1993). Ethical issues in research, teaching, and publication for industrial psychologists (chair, 
panelist). Society for Industrial and Organizational Psychology, San Francisco, CA. 

Camara, W. (1993). I/O Psychology in the Public-Policy-Making Process (panelist). Society for Industrial 
and Organizational Psychology, San Francisco, CA. 

Lipsitt, L. & Camara, W.J. (1993). The childhood origins of creativity. Paper presented at the Nebraska 
symposium on gifted children, Lawrence, Kansas. 

Camara, W. (1992). 100 Years of Psychological Testing (chair). Centennial Convention of the American 
Psychological Association, Washington, DC. 

Camara, W. (1992). Occupational Health Psychology: A new specialty for psychology and training needs. 
Centennial Convention of the American Psychological Association, Washington, DC. 

Camara, W. (1992). Correlates between personnel and educational assessment in national policy. Invited 
Address at the Annual Convention of the American Psychological Society, San Diego, CA. 

Camara, W.J. (1992). Affirmative Action and the Civil Rights Act of 1991. Invited address at International 
Personnel and Management Association, Baltimore, MD. 
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Camara, WJ. (1992). Americans with Disabilities Act and the Civil Rights Act of 1991: Implications for 
industrial psychologists. Annual Conference of the Society of Industrial and Organizational Psychology, 
Montreal, Quebec. 

Camara, W.J. (1992). Scans and America 2000. Annual Conference of the Society of Industrial and 
Organizational Psychology, Montreal, Quebec. 

Camara, W.J. (1991). Federal funding opportunities at ADAMHA, the Department of Energy and the 
Department of Agriculture (moderator). Thirty-sixty Institute on Federal Funding, National Graduate 
University, Washington, D.C. 

Camara, W.J. (1990). Disclosure of test scores, items and protocols in educational settings. Paper 
presented at the 98th Annual Convention of the American Psychological Association, Boston, MA. 

Camara, W.J. (1991). Integrity testing: Risks and rewards. Invited address, Personnel Testing Council of 
Washington, DC. 

Camara, W.J. (1989). Detecting dishonest employees: What is the state of the art? Paper presented at the 
Annual National Assessment Conference of the University of Minnesota and Personnel Decisions Inc., 
Minneapolis, MN. 

Camara, W.J. (1989). Predicting Flonesty: Scientific Evidence, Business Necessity, and Social Policy Issues. 
Paper presented at the 97th Annual Convention of the American Psychological Association, New Orleans, 
LO. 

Camara, W.J. (1989). Legal burden in employment selection: Recent court decisions. Symposium at the 
4th Annual Convention of the Society of Industrial/Organizational Psychologists, Boston, MA. 

Camara, W.J. and Kuhn, D. (1988). Development of a mixed standard rating scale for training and 
development. Paper presented in Division 14 of the 96th Annual Convention of the American Psychological 
Association, Atlanta, GA. 

Camara, W.J. (1987). The utility of a job-person match for personnel selection decisions. Paper presented 
in Division 14 of the 95th Annual Convention of the American Psychological Association, New York, NY. 
(ERIC Document Reproduction Service No. ED 289 149). 

Ziemak, J.P., Camara, W.J., Fisher, G.P., and Darmsteadt, G.H. (1987). Development of effective army 
civilian first-line supervisors. Paper presented at the 29th Annual Military Testing Association Conference, 
Quebec, Canada. 

Camara, W.J., Colot, P., Flutchinson, G., & Campbell, B. (1987). The reality of data collection. Paper 
presented at the 95th Annual Convention of the American Psychological Association, New York, NY. (ERIC 
Document Reproduction Service No. 290 775). 

Camara, W.J. (1986). The utility of biodata in predicting military performance. Paper presented at the 
26th Annual Military Testing Association Conference, Mystic, CT. 
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Camara, WJ. (1986). Effects of job previews on personnel selection. Paper presented at the National 
Conference of the Association of Human Resources Management and Organizational Behavior, New 
Orleans, LA. 

Camara, W.J. (1986). Equivalence of rater sources on job analysis ratings. Paper presented in division 14 
at the 94th Annual Convention of the American Psychological Association, Washington, D.C. (ERIC 
Doc.Reproduction Service No. ED 281455). 

Camara, W.J. and Means, B. (1986). Status of low-aptitude accessions following military service. Paper 
presented at the 94th Annual Convention of the American Psychological Association, Washington, D.C. 

Additional presentations at national and regional meetings and university colloquium not normally cited. 
9/2015 
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From: John S. Neikirk 

To: 'Wayne Camara' 

CC: Suzanne Lane (sl+@pitt.edu); dfrisbie@uiowa.edu; Felice Levine 

Sent: 2/14/2014 9:40:11 PM 

Subject: RE: Existing Standards 


Hi Wayne, thanks for your message and letting me know that this is possible. We are In communication with AERA's 
legal counsel, and perhaps also some outside help as well. We will be in touch as soon as we learn more. 

Best wishes, John 

John Neikirk 

Director of Publications 

American Educational Research Association 

1430 K Street, NW, Suite 1200 

Washington, DC 20005 

202.238.3238 

jneikirk@aera.net 


From: Wayne Camara [mailto:Wayne.Camara@act.org] 

Sent: Friday, February 07, 2014 8:26 AM 
To: John S. Neikirk 

Ce: Suzanne Lane (sl+@pitt.edu); dfrisbie@uiowa.edu 
Subject: RE: Existing Standards 

John - the management committee has funds to cover any legal work that might be needed to get this issue escalated. 
So feel free to let us know if you want to bring in some legal counsel to pursue this. 


From: John S. Neikirk rr na , ilto:JNeikirk@ a e ra. ne t] 

Sent: Wednesday, February 05,201410:35 AM 

To: Felice Levine; Wayne Camara; Ernesto, Marianne; 'lwise@humrro.org'; SL@pitt.edu : Frisbie, David A; Gerald Sroufe 

Cc: Barbara Plake 

Subject: RE: Existing Standards 

Hello everyone, I am following up on the messages from December about the pdf posting of Standar ds at: 
littpsV/lawresource.ora/pub/us/cfr/Lbr/OOl/aera.standards. 1999.pdf 

I had thought this was taken down after I wrote to the organization in December, asking that the pdf be taken down 
immediately. I checked yesterday and die pdf is still available, I wrote back to the organization, and I received a reply with the 
attached pdf letter, which had been sent in December apparently (the messages from tins organization go straight into spam). 

As you can see from the attached letter, the author (Carl Malamud of Public.Resource.Oig) claims that the 1999 edition is in 
now in the public domain because it was Incorporated by Reference by the Department of Education. 

Please see: http://www.ecfh gov/cgi-bin/text-idx?c^cfii:sid-c67daaa426ddc5f9838ea4247c36c938&tmi=div8&view=te?3& 

uode=34:3. L 3 1.34 10.39.8&idno=34 

We are looking into this now and will report back to the group when we have more information 
Best wishes, John 


John Neikirk 

Director of Publications 

American Educational Research Association 

1430 K Street, NW, Suite 1200 

Washington. DC 20005 
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From: Felice Levine 

Sent: Monday, December 16, 2013 2:38 PM 

To: 'Wayne Camara'; Ernesto, Marianne; ’lwise@humrro.org’; SL@oitt.edu: Frisble, David A; Gerald Snoufe 
Cc: Barbara Plake; John S. Neikirk 
Subject: RE: Existing Standards 

Yes. wow..,. This must be a copyright infringement or something to the equivalent.... But John will look into this, felice 


From: Wayne Camara rmailto:Wayne.Cama ra@act.org] 

Sent: Monday, December 16, 2013 2:16 PM 

To: Ernesto, Marianne; 'lwise@humrro.org'; SL@oitt.edu: Frisbie, David A; Gerald Sroufe; Felice Levine 

Cc: Barbara Plake 

Subject: RE: Existing Standards 

Jerry and Felice - can you look into this? We need to get them to pull this off the web. 


From: Ernesto, Marianne [, 

Sent: Monday, December 16,2013 1:09 PM 
To: ’lwise@humiTO.org’; SL@pitt.edu: Wayne Camara; Frisbie, David A 
Cc: Jerry Sroufe; Barbara Plake 
Subject: RE: Existing Standards 


This is news to mel 


Jerry and all, 

Any idea how this got posted? 
Marianne 


Sent: Monday, December 16,2013 1:53 PM 
To: SL@pitt.edu: Wayne Camara; Frisbie, David A 
Cc: Jerry Sroufe; Ernesto, Marianne; Barbara Plake 
Subject: Existing Standards 

Are they supposed to be available (in pdf form) free? 

https^/law.resource.Ore/pub/us/cfr/ibi/001/aera.standards. 1999. pdf 


Lauress (Laurie) L. Wise, Principal Scientist 

Human Resources Research Organization (HumRRO) 

20 Ragsdale Drive, Suite 260 

Monterey, CA 93940 

Phone: 831-647-1004 

Fax 831-375-4021 

Cell: 703-727-3817 
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UNITED STATES DISTRICT COURT 
FOR THE DISTRICT OF COLUMBIA 


AMERICAN EDUCATIONAL RESEARCH ) 
ASSOCIATION, INC., AMERICAN ) 

PSYCHOLOGICAL ASSOCIATION, INC., ) 
and NATIONAL COUNCIL ON ) 

MEASUREMENT IN EDUCATION, INC., ) 

) 

Plaintiffs, ) 

) 

v. ) 

) 

PUBLIC.RESOURCE.ORG, INC., ) 

) 

Defendant. ) 

_ ) 


Civil Action No. l:14-cv-00857-TSC-DAR 

DECLARATION OF FELICE 
J. LEVINE IN SUPPORT OF 
PLAINTIFFS’ MOTION FOR 
SUMMARY JUDGMENT AND ENTRY 
OF A PERMANENT INJUNCTION 


I, FELICE J. LEVINE, declare: 

1. I am the Executive Director of the American Educational Research Association, 


Inc. (“AERA”) I have been employed by the AERA since May 2002. I submit this Declaration 
in support of the motion of the AERA, the American Psychological Association, Inc. (“APA”), 
and the National Council on Measurement in Education, Inc. (“NCME”) (collectively, 
“Plaintiffs” or “Sponsoring Organizations”) for summary judgment and the entry of a permanent 
injunction. 

2. As set forth in the AERA Bylaws, the Executive Director is the chief executive 
officer of the Association. In that capacity, I am responsible for all programmatic, financial, 
administrative, staffing, and managerial responsibilities of the AERA. I also advise on and 
implement the policies that guide our organization. 

3. As publisher, the AERA has provided general oversight since November 1999 for 
the production, printing, sales, and marketing of the “Standards for Educational and 
Psychological Testing” (the “Standards”), and for the fiscal management of the revenue and 
expenditure of funds and resources of that publication. AERA was selected to serve as publisher 
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by the Management Committee of the three Sponsoring Organizations. As the Executive 
Director of the AERA, I have administrative oversight over all of AERA’s implementation of its 
responsibilities regarding the Standards. 

4. AERA is a District of Columbia not-for-profit corporation. 

5. AERA is the major national scientific society for research on education and 
learning. AERA’s mission is to advance knowledge about education, to encourage scholarly 
inquiry related to education, and to promote the use of research to improve education and serve 
the public good. 

6. In 1955, Plaintiffs AERA and NCME prepared and published a companion 
document to APA’s “Technical Recommendations for Psychological Tests and Diagnostic 
Techniques” (published in 1954), entitled, “Technical Recommendations for Achievement 
Tests.” Subsequently, a joint committee of the three organizations modified, revised, and 
consolidated the two documents into the first Joint Standards. Beginning with the 1966 revision, 
the Sponsoring Organizations collaborated in developing the “Joint Standards” (or simply, the 
“Standards”). Each subsequent revision of the Standards has been careful to note that it is a 
revision and update of the prior version. 

7. Beginning in the mid-1950s, the Sponsoring Organizations formed and 
periodically reconstituted a committee of highly trained and experienced experts in 
psychological and educational assessment, charged with the initial development of the Technical 
Recommendations and then each subsequent revision of the (renamed) Standards. These 
committees were formed by the Sponsoring Organizations’ Presidents (or their designees), who 
would meet and jointly agree on the membership. Often a chair or co-chairs of these committees 
were selected by joint agreement. Beginning with the 1966 version of the Standards, this 
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committee became referred to as the “Joint Committee.” 

8. Financial and operational oversight for the Standards’ revisions, promotion, 
distribution, and for the sale of the 1999 and 2014 Standards has been undertaken by a 
periodically reconstituted Management Committee, comprised of the designees of the three 
Sponsoring Organizations. As Publisher of the 1999 and 2014 Standards, AERA works in 
consultation with the Management Committee to implement its managerial guidance. 

9. All members of the Joint Committee(s) and the Management Committee(s) are 
unpaid volunteers. The expenses associated with the ongoing development and publication of 
the Standards include travel and lodging expenses (for the Joint Committee and Management 
Committee members), support staff time, production, printing and shipment of bound volumes, 
and advertising costs. For the 2014 Standards, the production, printing and shipment of bound 
volumes, and advertising costs, are paid for by the publisher, AERA. 

10. Many different fields of endeavor rely on assessments. The Sponsoring 
Organizations have ensured that the range of these fields of endeavor is represented in the Joint 
Committee’s membership - e.g., admissions, achievement, clinical counseling, educational, 
Iicensing-credentialing, employment, policy, and program evaluation. Similarly, the Joint 
Committee’s members, who are unpaid volunteers, represent expertise across major functional 
assessment areas - e.g., validity, equating, reliability, test development, scoring, reporting, 
interpretation, and large scale interpolation. 

11. From the time of their initial creation to the present, the preparation of and 
periodic revisions to the Standards entail intensive labor and considerable cross-disciplinary 
expertise. Each time the Standards are revised, the Sponsoring Organizations select and arrange 
for extensive meetings of and work by the leading authorities in psychological and educational 


-3- 

JA2581 


Case l:14-cv-00857-TSC Document 60-78 Filed 12/21/15 Page 4 of 9 
USCA Case #17-7035 Document #1715850 Filed: 01/31/2018 Page 278 of 517 

assessments (known as the Joint Committee). During these meetings, certain Standards are 
combined, pared down, and/or augmented, others are deleted altogether, and some are created as 
whole new individual Standards. The 1999 version of the Standards is nearly 200 pages, took 
more than five years to complete. 

12. The Standards were not created or updated to serve as a legally binding document, 
in response to an expressed governmental or regulatory need, nor in response to any legislative 
action or judicial decision. However, the Standards have been cited injudicial decisions related 
to the proper use and evidence for assessment, as well as by state and federal legislators. These 
citations in judicial decisions and during legislative deliberations occurred without any lobbying 
by the Plaintiffs. 

13. AERA has not solicited any government agency to incorporate the Standards into 
the Code of Federal Regulations or other rules of Federal or State agencies. 

14. Plaintiffs promote and sell copies of the Standards via referrals to the AERA 
website, at annual meetings, in public offerings to students, and to educational institution faculty. 
Advertisements promoting the Standards have appeared in meeting brochures, in scholarly 
journals, and in the hallways at professional meetings. Accompanying this Declaration as 
Exhibit NNN is a true copy of advertisements promoting the 1999 Standards, marked as Exhibit 
1218 during my deposition. 

15. All copies of the Standards bear a copyright notice. 

16. Distribution of the Standards is closely monitored by the Sponsoring 
Organizations. AERA, the designated publisher of the Standards, sometimes provides 
promotional complementary print copies to students or professors. Except for these few 
complementary print copies, however, the Standards are not given away for free; and certainly 
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they are not made available to the public by any of the three organizations for anyone to copy 
free of charge. To date, AERA has never posted, or authorized the posting of, a digitized copy of 
the 1999 Standards on any publicly accessible website. 

17. The 1999 Standards have been sold at retail prices ranging from $25.95 to $49.95 
per copy. From 2000 to 2014, except for the near two-year period during which Public Resource 
posted unauthorized copies online and sales diminished significantly, income generated from 
sales of the 1999 Standards, on average, had been approximately in excess of $127,000 per year. 

18. Accompanying this Declaration as Exhibit OOO is a true copy of AERA’s 
Statement of Revenue and Expenses for the Standards from FY2000 to December 31, 2013, 
marked as Exhibit 1211 during my deposition. 

19. After the 2014 Standards were published in the late summer of 2014, AERA for a 
time discontinued sales of the 1999 Standards. This was to encourage sales of the newly-revised 
edition - the 2014 Standards. Accompanying this Declaration as Exhibit PPP is a true copy of 
the publication page for the 1999 Standards on the AERA website as of May 4, 2015 showing 
that the 1999 Standards were not available for sale at that time, marked as Exhibit 1196 during 
my deposition. 

20. However, so long as purchasers are made aware that it is no longer the current 
edition, the 1999 Standards do have an enduring value for those in the testing and assessment 
profession who (i) need to know the state of best testing practices as they existed between 1999 
and 2014, (ii) believe they still may be held accountable to the guidance of the 1999 Standards 
even now, and/or (iii) study the changes in best testing and assessment practices over time. For 
this reason, in the summer of 2015 AERA resumed sales of the 1999 Standards. Accompanying 
this Declaration as Exhibit QQQ is a true copy of the publication page for the 1999 Standards on 
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the AERA website as updated during the summer of 2015, showing that the 1999 Standards are 
available for sale. 

21. All revenue from the sale of the 1999 Standards above expenses is used to cover 
the publishing costs of the Standards and for the preparation of subsequent editions of the 
Standards. The Sponsoring Organizations do not distribute any proceeds from the sales of the 
Standards to the Sponsoring Organizations. Rather, the income from these sales is used by the 
Sponsoring Organizations to offset their development and production costs and to generate funds 
for subsequent revisions. This allows the Sponsoring Organizations to develop up-to-date, high 
quality Standards that otherwise would not be developed due to the tune and effort that goes into 
producing them. 

22. Without receiving revenue from the sales of the Standards to offset their 
preparation costs and to allow for further revisions, it is very likely that the Sponsoring 
Organizations would no longer undertake to periodically update them, and it is unknown who 
else would. 

23. The Sponsoring Organizations decided on a model of self-funding of revisions of 
the Standards; that is, from the sale of prior editions of the Standards. Funding for the Standards 
revision process from third party sources ( e.g governmental agencies, foundations, other 
associations interested in testing and assessment issues, etc.) was rejected because of the 
appearance or potential of conflicts of interest and the importance of users of the Standards being 
able to trust in their scientific integrity. 

24. Due to the relative minor portion of the membership of AERA who devote their 
careers to testing and assessment, it is highly unlikely that the members of AERA will vote for a 
dues increase to fund future Standards revision efforts if Public Resource successfully defends 
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this case and is allowed to post the Standards online for the public to download or print for free. 
As a result, the Sponsoring Organizations would likely abandon their practice of periodically 
updating the Standards. 

25. The Standards were registered with the U.S. Register of Copyrights under 
Registration Number TX 5-100-196, having an effective date of December 8, 1999. 
Accompanying this Declaration as Exhibit RRR is a true copy of the December 8, 1999 
Copyright Certificate of Registration for the 1999 Standards. 

26. A supplementary copyright registration for the Standards was issued by the U.S. 
Register of Copyrights under Supplementary Registration Number TX 6-434-609, having an 
effective date of February 25, 2014. This Supplementary Registration was obtained to correct an 
error in the listing of copyright ownership in Registration Number TX 5-100-196. 
Accompanying this Declaration as Exhibit SSS is a true copy of the February 25, 2014 
Supplementary Copyright Certificate of Registration for the 1999 Standards. 

27. The Joint Committee that authored the 1999 Standards comprised 16 members. 

28. Accompanying this Declaration as Exhibit TTT is a true copy of the 1999 
Standards. 

29. Public Resource posted Plaintiffs’ 1999 Standards to its website and the Internet 
Archive website without the permission or authorization of any of the Sponsoring Organizations. 

30. The Sponsoring Organizations can only speculate on the number of electronic 
copies of the 1999 Standards that were made and distributed to others by the original Internet 
users who accessed the unauthorized copies that Public Resource posted to its site and the 
Internet Archive site. There simply is no way for the Sponsoring Organizations to calculate with 
any degree of certainty the number of university/college professors, students, testing companies 
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and others who would have purchased Plaintiffs’ Standards but for their wholesale posting on 
Defendant’s https://law.resource.org website and the Internet Archive http://archive.org website. 

31. In December 2013, Plaintiff AERA requested in writing that Public Resource 
remove the 1999 Standards from its online postings. Accompanying this Declaration as Exhibit 
UUU is a true copy of a letter sent from John S. Neikirk, Director of Publications at AERA, to 
Carl Malamud of Public Resource regarding the posting of the 1999 Standards at 
https://law.resource.org/pub/us/cfr/ibr/001/aera.standards.1999.pdf, marked as Exhibit 1228 
during my deposition. 

32. Had Public Resource not promised to remove the 1999 Standards from its 
law.resource.org website and the Internet Archive website while this lawsuit is pending, and 
followed through with, these promises, the Sponsoring Organizations seriously contemplated 
moving forward with a motion to preliminary enjoin Public Resource from maintaining the 
unauthorized postings of electronic copies of the 1999 Standards on the Internet, and delaying 
publication of the 2014 Standards. 

33. By June 2014, when Public Resource finally removed its online postings of the 
1999 Standards, the damage already had been done. In Fiscal Year (“FY”) 2011 to FY 2012, as 
compared to FY 2011, the Sponsoring Organizations experienced a 34% drop in sales of the 
1999 Standards. In FY 2013, sales of the 1999 Standards remained at their low level from the 
prior fiscal year. 

34. This is notable, given that Public Resource posted the Standards to the Internet in 
2012-2013, and that the Sponsoring Organizations’ updated Standards were not published until 
the summer of 2014. 

35. Past harm from Public Resource’s infringing activities includes lost sales that 
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cannot be totally accounted for - due to potentially infinite Internet distribution; for example, by 
psychometrics students - and a lack of funding that otherwise would have been available for the 
update of the Sponsoring Organizations’ Standards from the 1999 to the 2014 versions. 

36. Should Public Resource’s infringement be allowed to continue, the harm to the 
Sponsoring Organizations, and public at large who rely on the preparation and administration of 
valid, fair and reliable tests, includes: (i) uncontrolled publication of the 1999 Standards without 
any notice that those guidelines have been replaced by the 2014 Standards; (ii) future 
unquantifiable loss of revenue from sales of authorized copies of the 1999 Standards (with 
proper notice that they are no longer the current version) and the 2014 Standards; and (iii) lack of 


funding for future revisions of the 2014 St; 



Dated: December /^2015 
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There have been five earlier documents from 
three sponsoring organizations guiding the 
developmenr and use of tests. The firsr of these 
was Technical Recommendations for Psychological 
Tests and Diagnostic Techniques, prepared by 
a committee of the American Psychological 
Association (APA) and published by that 
organization in 1954. The second was Technical 
Recommendations for Achievement Tests, prepared 
by a committee representing the American 
Educational Research Association (AERA) 
and the National Council on Measurement 
Used in Education (NCMUE) and published 
by the National Education Association in 
1955. The third, which replaced the earlier 
two, was published by APA in 1966 and 
prepared by a committee representing APA, 
AERA, and the National Council on 
Measurement in Education (NCME) and 
called the Standards for Educational and 
Psychological Tests and Manuals. The fourth, 
Standards for Educational and Psychological 
Tests, was again a collaboration of AERA, APA 
and NCME. and was published in 1974. The 
fifth. Standards for Educational and Psychological 
Testing, also a joint collaboration, was pub¬ 
lished in 1985 

In 1991 APAs Committee on Psycholo¬ 
gical Tests and Assessment suggested the need 
to revise the 1985 Standards. Representatives 
of AERA, APA and NCME met and discussed 
the revision, principles that should guide 
that revision, and potenrial Joint Committee 
members. By 1993, the presidents of the 
three organizations appointed members 
and the Committee had its first meeting 
November, 1993. 

The Standards has been developed by a 
joint committee appointed by AERA, APA and 
NCME. Members of the Committee were: 

Eva Baker, co-chair 

Paul Sackett, co-chair 

Lloyd Bond 

Leonard Feld: 


David Goh 
Bert Green 
Edward Haertel 
Jo-lda Hansen 
Sharon Johnson-Lewis 
Suzanne Lane 
Joseph Matarazzo 
Manfred Meier 
Pamela Moss 
Esteban Olmedo 
Diana Pullin 

From 1993 to 1996 Charles Spielberger 
served on the Committee as co-chair. Each 
sponsoring organization was permitted 
to assign up to two liaisons to the Joint 
Committees project. Liaisons served as the 
conduits between the sponsoring organiza¬ 
tions and the joint Committee. APA’s liaison 
from its Committee on Psychological Tests 
and Assessments changed several times as the 
membership of the Committee changed. 

Liaisons to the Joint Committee: 

AERA -William Mehrens 
APA - Bruce Bracken, Andrew Czopek, 
Rodney Lowman, Thomas Oakland 
NCME - Daniel Eignor 

APA and NCME also had committees 
who served to monitor che process and keep 
relevant parries informed. 

APA Ad Hoc Committee of the Council of 
Representatives: 

Melba Vasquez 
Donald Bersoff 
Stephen DeMers 
James Farr 
Bertram Karon 
Nadine Lambert 
Charles Spielberger 

NCME Standards and Test Use Committee: 

Gregory Cizck 
Allen Doolitdc 
Le Ann Gamache 


JA2603 


Case l:14-cv-00857-TSC Document 60-85 Filed 12/21/15 Page 9 of 100 
USCA Case #17-7035 Document #1715850 Filed: 01/31/2018 Page 300 of 517 

PREFACE 


Donald Ross Green 
Ellen Julian 
Tracy Mucnz 
Nambury Raju 

A management committee was formed at 
the beginning of this effort. They monitored 
the financial and administrative arrangements 
of the project, and advised the sponsoring 
organizations on such matrers. 

Management Committee: 

Frank FarJcy, APA 
George Madaus, AERA 
Wendy Yen, NCME 
Staffing for the revision included Dianne 
Brown Maranto as project director, and 
Dianne L. Schneider as staff liaison. Wayne J. 
Camara served as project director from 1993 to 
1994. APA’s legal counsel conducted the legal 
review of the Standards. William C. Howell 
and William Mehrens reviewed die standards 
for consistency across chapters. Linda Murphy 
developed the indexing for the book. 

The Joint Committee solicited prelimi¬ 
nary reviews of some draft chapters, from rec¬ 
ognized experts. These reviews were primarily 
solicited for the technical and fairness chap¬ 
ters. Reviewers are listed below: 

Marvin Alkin 
Philip Bashook 
Bruce Bloxom 
Jeffery P. Braden 
Robert L. Brennan 
John Callender 
Ronald Cannella 
Lee J. Cronbach 
James Cummins 
John Fremer 
Kurt F. Geisinger 
Robert M. Guion 
Walter Haney 
Patti L, Harrison 
Gerald P. Koocher 
Richard Jeanneret 


Frank Landy 
Ellen Lent 
Robert Linn 
Theresa C. Liu 
Stanford von Mayrhauser 
MilbreyW McLaughlin 
Samuel Messick 
Craig N. Mills 
Robert J. Mislevy 
Kevin R. Murphy 
Mary Anne Nester 
Maria Pennock-Roman 
Carole Perlman 
Michael Rosenfeld 
Jonathan Sandoval 
Cynthia B. Schmeiser 
Kara Schmitt 
Neal Schmitt 
Richard J. Shavclson 
Lorric A. Shepard 
Mark E. Swerdlik 
Janet Wall 
Anthony R. Zara 

Draft versions of the Standards were 
widely distributed for public review and 
comment three times during this revision 
effort, providing the Committee with a 
total of nearly 8,000 pages ofcommenrs. 
Organizations who submitted comments on 
drafts are listed below. Many individuals 
contributed to the input from each organi¬ 
zation, and although we wish we could 
acknowledge every individual who had input, 
we cannot do so due to incomplete informa¬ 
tion as to who contributed to each organiza¬ 
tion's response. The Joint Committee could 
not have completed its task without the 
thoughtful reviews of so many professionals. 
Sponsoring Associations 

American Educational Research 
Association (AERA) 

American Psychological Association (APA) 
National Council on Measurement in 
Education (NCME) 
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Membership Organizations (Scientific, 
Professional, Trade 8C Advocacy) 

American Association for Higher 
Education (AAHE) 

American Board of Medical Specialties 
(ABMS) 

American Counseling Association (ACA) 
American Evaluation Association (AEA) 
American Occupational Therapy 
Association 

American Psychological Society (APS) 
APA Division of Counseling Psychology 
(Division 17) 

APA Division of Developmental 
Psychology (Division 7) 

APA Division of Evaluation, Measuremem. 

and Statistics (Division 5) 

APA Division of Mental Retardation & 
Developmental Disabilities (Division 33) 
APA Division of Pharmacology & 
Substance Abuse (Division 28) 

APA Division of Rehabilitation 
Psychology (Division 22) 

APA Division of School Psychology 
(Division 16) 

Asian American Psychological 
Association (AAPA) 

Association for Assessment in 
Counseling (AAC) 

Association ofTest Publishers (ATP) 
Australian Council for Educational 
Research Limited (ACER) 

Chicago Industrial/Organizational 
Psychologists (CIOP) 

Council on Licensure, Enforcement, and 
Regulation (CLEAR), Examination 
Resources & Advisory Committee 
(ERAC) 

Equal Employment Advisory Council 
(EEAC) 

Foundation for Rehabilitation 

Certification, Education and Research 
Human Sciences Research Council, 

South Africa 

International Association for Cross- 
Cultural Psychology (IACCP) 


International Brotherhood of Electrical 
Workers 

international Language Testing Association 
International Personnel Management 
Association Assessment Council 
(1PMAAC) 

Joint Committee on Testing Practices 
(JCTP) 

National Association for the Advancement 
of Colored People (NAACP), Legal 
Defense and Educational Fund, Inc. 
National Center for Fair and Open 
Testing (Fairtesc) 

National Organization for Competency 
Assurance (NOCA) 

Personnel Testing Council of Metropolitan 
Washington (PTC/MW) 

Personnel Testing Council of Southern 
California (PTC/SC) 

Society for Human Resource Management 
(SHRM) 

Society of Indian Psychologists (SIP) 
Society for Industrial and Organizational 
Psychology (APA Division 14) 

Society for the Psychological Study 
of Ethnic Minority Issues (APA 
Division 45) 

State Collaborative on Assessment 8c 
Student Standards Technical Guidelines 
for Performance Assessment 
Consortium (TGPA) 
Telecommunications Staffing Forum 
Western Region Intergovernmental 
Personnel Assessment Council 
(WRIPAC) 

Credentiaiing Boards 

American Board of Physical and Medical 
Rehabilitation 

American Medical Technologists 
Commission on Rehabilitation 
Counselor Certification 
National Board for Certified Counselors 
(NBCC) 

National Board of Examiners in 
Opcometry 


VII 
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National Board of Medical Examiners 
National Council of State Boards of 
Nursing 

Government and Federal Agencies 

Army Research Institute (AR1) 

California Highway Patrol, Personnel and 
Training Division, Selection Research 
Program 

City of Dallas, Civil Service Department 
Commonwealth of Virginia, Department 
of Education 

Defense Manpower Data Center 

(DMDC), Personnel Testing Division 
Department of Defense (DOD), Office 
of the Assistant Secretary of Defense 
Department of Education, Office of 
Educational Improvement, National 
Center for Education Statistics 
Department of Justice, Immigration and 
Naturalization Service (INS) 
Department of Labor, Employment and 
Training Administration (DOL/ETA) 
U.S. Equal Employment Opportunity 
Commission (EEOC) 

U.S. Office of Personnel Management 
(OPM), Personnel Resources fie 
Development Center 
Test Publishers/Developers 

American College Testing (ACT) 
CTB/McGraw-HiJI 
The College Board 
Educational Testing Service (ETS) 
Highland Publishing Company 
Institute for Personality & Ability 
Testing (IPAT) 

Professional Examination Service (PES) 
Academic Institutions 

Center for Creative Leadership 
Gallauder University, National Task 
Force on Equity in Testing Deaf 
Professionals 

University of Haifa, Israeli Group 
Kansas Stare University 
National Center on Educational 
Outcomes (NCEO) 


Pennsylvania State University 
University of North Carolina - Charlotte 
University of Southern Mississippi. 

Department of Psychology 

When the Joint Committee completed 
its task of revising the Standards, it then 
submitted its work to the three sponsoring 
organizations for approval. Each organization 
had its own governing body and mechanism 
for approval, as well as definitions for what 
their approval means. 

AERA: This endorsement carries with it 
the understanding that, in general, we 
believe the Standards to represent rhe 
current consensus among recognized 
professionals regarding expected meas¬ 
urement practice. Developers, sponsors, 
publishers, and users of tests should 
observe these Standards. 

APA: The APA’s approval of the 
Standards means the Council adopts 
the document as APA policy. 

NCME: NCME endorses the Standards 
for Educational and Psychological Testing 
and recognizes that the intent of these 
Standards is to promote sound and 
responsible measurement practice. This 
endorsement carries with it a profes¬ 
sional imperative for NCME members 
to attend to the Standards. 

Although the Standards are prescriptive, the 
Standards itself does nor contain enforcement 
mechanisms. These standards were formulated 
with the intent of being consistent with other 
standards, guidelines and codes of conduct 
published by the direc sponsoring organizations, 
and listed below. The reader is encouraged to 
obtain these documents, some of which have 
references to tesring and assessment in specific 
applications or settings. 

The Joint Committee on the 
Standards for Educational and 
Psychological Testing 
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INTRODUCTION 


Educational and psychological testing and 
assessment are among the most important 
contributions of behavioral science to our 
society, providing fundamental and signifi¬ 
cant improvements over previous practices. 
Although not all tests are well-developed nor 
are all testing practices wise and beneficial, 
there is extensive evidence documenting the 
effectiveness of well-constructed tests for uses 
supported by validity evidence. The proper 
use of rests can result in wiser decisions about 
individuals and programs than would be the 
case without their use and also can provide a 
route to broader and more equitable access to 
education and employment. The improper 
use of tests, however, can cause considerable 
harm to test takers and other parties affected 
by test-based decisions. The intent of the 
Standards is to promote the sound and ethical 
use of tests and to provide a basis for evaluat¬ 
ing the quality of testing practices. 

Participants in the Testing Process 

Educational and psychological testing and 
assessment involve and significantly affect 
individuals, institutions, and society as a 
whole. The individuals affected include stu¬ 
dents, parents, teachers, educational adminis¬ 
trators, job applicants, employees, clients, 
patients, supervisors, executives, and evalua¬ 
tors. among others. The institutions affected 
include schools, colleges, businesses, industry, 
clinics, and government agencies. Individuals 
and institunons benefit when tesring helps them 
achieve their goals. Society, in turn, benefits 
when testing contributes to the achievement 
of individual and institutional goals. 

The interests of the various parties 
involved in the testing process are usually, 
but not always, congruent. For example, 
when a test is given for counseling purposes 
or for job placement, the interests of the 
individual and the institution often coin¬ 
cide. In contrast, when a test is used to 


select from among many individuals for a 
highly competitive job or for entry into an 
educational or training program, the prefer¬ 
ences of an applicant may be inconsistent 
with those of an employer or admissions 
officer. Similarly, when testing is mandated 
by a court, the interests of the test taker may 
be different from chose of the parry requesting 
the court order. 

There arc many participants in the testing 
process, including, among others: (a) those who 
prepare and develop the test; (b) those who 
publish and market the test; (c) those who 
administer and score the test; (d) those who 
use the test resulcs for some decision-making 
purpose; (e) those who interpret test results for 
clients; (f) those who take the test by choice, 
direction, or necessity; (g) those who sponsor 
cests, which may be boards that represent 
institutions or governmental agencies that 
contract with a test developer for a specific 
instrument or service; and (h) those who select 
or review tests, evaluating their comparative 
merits or suitability for the uses proposed. 

These roles are sometimes combined and 
sometimes further divided. For example, in 
clinics the test taker is typically the intended 
beneficiary of the test results. In some situa¬ 
tions the test administrator is an agent of the 
test developer, and sometimes the test admin¬ 
istrator is also the test user. When an industrial 
organization prepares its own employment 
tests, it is both the developer and the user. 
Sometimes a test is developed by a test author 
but published, advertised, and distributed by 
an independent publisher, though the publisher 
may play an active role in the test development. 
Given this intermingling of roles, it is difficult 
to assign precise responsibility for addressing 
various standards to specific participants in 
the testing process. 

This document begins with a series of 
chapters on the tesc development process, 
which focus primarily on the responsibilities 
of test developers, and then turns to chapters 
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on specific uses and applications, which Focus 
primarily on responsibilities of test users. One 
chapter is devoted specifically to the rights 
and responsibilities of test takers. 

The Standards is based on the premise 
that effective testing and assessment require 
that all participants in che testing process pos¬ 
sess the knowledge, skills, and abilities rele¬ 
vant to their role in the testing process, as 
well as awareness of personal and contextual 
factors that may influence the testing process. 
They also should obtain any appropriate 
supervised experience and legislatively man¬ 
dated practice credentials necessary to perform 
competently those aspects of the testing 
process in which chey engage. For example, 
test developers and those selecting and 
interpreting tests need adequate knowledge 
of psychometric principles such as validity 
and reliability. 

The Purpose of the Standards 

The purpose of publishing the Standards is 
to provide criteria for the evaluation of tests, 
testing practices, and the effects of test use. 
Although the evaluation of the appropriate¬ 
ness of a test or testing application should 
depend heavily on professional judgment, the 
Standards provides a frame of reference to 
assure that relevant issues are addressed. It is 
hoped that all professional test developers, 
sponsors, publishers, and users will adopt the 
Standards and encourage others to do so. 

The Standards makes no attempt to pro¬ 
vide psychometric answers co questions of 
public policy regarding the use of tests. In 
general, the Standards advocates that, within 
feasible limits, the relevant technical informa¬ 
tion be made available so that those involved 
in policy debate may be fully informed. 

Categories of Standards 

The 1985 Standards designated each standard 
as “primary” (to be met by all tests before 
operational use}, “secondary" (desirable, but 


not feasible in certain situations), ot' condi¬ 
tional" (importance varies with application). 
The present Standards continues die tradition 
of expecting test developers and users to con¬ 
sider all standards before operational use; 
however, the Standards does not continue the 
practice of designating levels of importance. 
Instead, the text of each standard, and any 
accompanying commentary, discusses the 
conditions under which a standard is relevant. 
It was not the case that under the 1985 
Standards test developers and users were obli¬ 
gated to attend only to the primary standards. 
Rather, the term "conditional” meant that a 
standard was primary in some settings and 
secondary' in others, thus requiring careful 
consideration of the applicability of each stan¬ 
dard for a given setting. 

The absence of designations such as 
“primary" or “conditional" should not be 
taken to imply that all standards are equally 
significant in any given situation. Depending 
on the context and purpose of test develop¬ 
ment ot use, some standards will be more 
salient than others. Moreover, some standards 
are broad in scope, setting forth concerns or 
requirements relevant to nearly alt tests or 
testing contexts, and other standards are nar¬ 
rower in scope. However, all standards are 
important in the contexts to which they 
apply. Any classification that gives the appear¬ 
ance of elevating the general importance of 
some standards over others could invite neglect 
of some standards that need to be addressed 
in particular situations. 

Further, the current Standards does not 
include standards considered secondary oi 
“desirable," The continued use of the second¬ 
ary designation would risk encouraging both 
the expansion of the Standards to encompass 
large numbers of “desirable" standards and 
che inappropriate assumption that any guide¬ 
line not included in the Standards as at least 
“secondary” was inconsequential. 

Unless otherwise specified in the stan¬ 
dard or commentary, and with che caveats 
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outlined below, standards should be met 
before operational test use. This means that 
each standard should be carefully considered 
to determine its applicability to rhe resting 
context under consideration. In a given case 
there may be a sound professional reason why 
adherence to the standard is unnecessary. It is 
also possible rhat there may be occasions 
when technical feasibility may influence 
whether a standard can be met prior to 
operational test use. For example, some 
standards may call for analyses of data that 
may nor be available at the point of initial 
operational test use. If test developers, users, 
and, when applicable, sponsors have deemed 
a standard to be inapplicable or unfeasible, 
they should be able, if called upon, to explain 
the basis for their decision. However, there 
is no expectation that documentation be 
routinely available of the decisions related 
to each standard. 

Tests and Test Uses to 
Which These Standards Apply 

A test is an evaluative device or procedure in 
which a sample of an examinee's behavior in a 
specified domain is obtained and subsequent¬ 
ly evaluated and scored using a standardized 
process. While the label test is ordinarily 
reserved for instruments on which responses 
are evaluated for their correctness or quality 
and the terms scale or inventory are used for 
measures of attitudes, interesr, and disposi¬ 
tions, the Standards uses the single term test 
to refer to all such evaluative devices. 

A distinction is sometimes made between 
test and assessment. Assessment is a broader 
term, commonly referring to a process that 
integrates test information with information 
From ocher sources (e.g., information from 
the individual's social, educational, employ¬ 
ment, or psychological history). The applica¬ 
bility of the Srandards to an evaluation device 
or method is not altered by the label applied 
to it (e.g., test, assessment, scale, inventory). 


Tests differ on a number of dimensions-, 
the mode in which test materials are present¬ 
ed (paper and pencil, oral, computerized 
administration, and so on); the degree to 
which stimulus materials are standardized; 
the type of response format (selection of a 
response (tom a set of alternatives as opposed 
to the production of a response); and the 
degree to which test materials are designed to 
reflect or simulate a particular context. In all 
cases, however, tests standardize the process 
by which test-taker responses to test materials 
are evaluaced and scored. As nored in prior 
versions of the Standards , the same general 
types of information are needed for all vari¬ 
eties of tests. 

The precise demarcation between those 
measurement devices used in the fields of 
educational and psychological testing that do 
and do not fall within the purview of the 
Standards is difficult to identify Although rhe 
Standards applies most directly to standard¬ 
ized measures generally recognized as “tests," 
such as measures of ability, aptitude, achieve¬ 
ment, attitudes, interests, personality, cogni¬ 
tive functioning, and mental health, it may 
also be usefully applied in varying degrees to 
a broad range of less formal assessment tech¬ 
niques. Admittedly, it will generally not be 
possible to apply the Standards rigorously to 
unstandardized questionnaires or to the broad 
range of unstructured behavior samples used 
in some forms of clinic- and school-based 
psychological assessment (e.g., an intake inter¬ 
view). and to instructor-made tesrs rhat are 
used to evaluate student performance in edu¬ 
cation and training. It is useful to distinguish 
between devices thar lay claim to the concepts 
and techniques of the field of educational and 
psychological testing from those which repre¬ 
sent nonstandardized or less standardized aids 
to day-to-day evaluative decisions. Although 
the principles and concepts underlying the 
Standards can be fruitfully applied to day-to- 
day decisions, such as when a business owner 
interviews a job applicant, a manager evalu- 
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ates the performance of subordinates, or a 
coach evaluates a prospective athlete, it would 
be overreaching to expect that the standards 
of the educational and psychological testing 
field be followed by those making such deci¬ 
sions. In contrast, a structured interviewing 
system developed by a psychologist and 
accompanied by claims that the system has 
been found to be predictive of job perform¬ 
ance in a variety of other settings falls within 
the purview of the Standards. 

Cautions to be Exercised in Using 
the Standards 

Several cautions are important to avoid mis¬ 
interpreting the Standards-. 

1) Evaluating the acceptability of a test 
or test application does not rest on the literal 
satisfaction of every standard in this docu¬ 
ment, and acceptability cannot be determined 
by using a checklist. Specific circumstances 
affect the importance of individual standards, 
and individual standards should not be con¬ 
sidered in isolation. Therefore, evaluating 
acceptability involves (a) professional judgment 
chat is based on a knowledge of behavioral sci¬ 
ence, psychometrics, and the community 
standards in the professional field to which 
the tests apply; (b) the degree to which the 
intent of the standard has been satisfied by 
the test developer and user; (c) the alternatives 
that are readily available; and (d) research and 
experiencial evidence regarding feasibility of 
meeting the standard. 

2) When tests are at issue in legal pro¬ 
ceedings and orher venues requiring expert 
witness Testimony it is essential that profes¬ 
sional judgment be based on the accepted 
corpus of knowledge in determining the rele¬ 
vance of particular standards in a given situa¬ 
tion. The intent of the Standards is to offer 
guidance for such judgments. 

3) Claims by test developers or test users 
that a test, manual, or procedure satisfies or 
follows these standards should be made with 


care. It is appropriate for developers or users 
to state that efforts were made to adhere to 
the Standards, and to provide documents 
describing and supporting those efforts. 
Blanket claims without supporting evidence 
should not be made. 

4) These standards are concerned with a 
field char is evolving. Consequently, there is 
a continuing need to monitor changes in the 
field and to revise this document as knowl¬ 
edge develops. 

5) Prescription of the use of specific 
technical methods is not the intent of the 
Standards. For example, where specific statis¬ 
tical reporting requirements are mentioned, 
the phrase "or generally accepted equivalent" 
always should be understood. 

The standards do nor attempt to repeat 
or to incorporate the many legal or regulatory 
requirements that might be relevant to the 
issues they address. In some areas, such as the 
collection, analysis, and use of rest data and 
results for different subgroups, the law may 
both require participants in the testing process 
to take certain actions and prohibit those 
participants from taking other actions. Where 
it is apparent that one or more standards or 
comments address an issue on which estab¬ 
lished legal requirements may be particularly 
relevant, the standard, comment, or introduc¬ 
tory material may make note of that fact. 
Lack of specific reference to legal require¬ 
ments, however, does not imply that no rele¬ 
vant requirement exists. In all situations, 
participants in the testing process should 
separately consider and, where appropriate, 
obtain legal advice on legal and regulatory 
requirements. 

The Number of Standards 

The number of standards has increased from 
the 1985 Standards for a variety of reasons, 
First, and most importantly, new develop¬ 
ments have led to the addition of new stan¬ 
dards. Commonly these deal with new types 
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of tests or new uses tor existing tests, rather 
than being broad standards applicable to all 
tests. Second, on the basis of recognition that 
some users of the Standards may turn only to 
chapters directly relevant to a given applica¬ 
tion, certain standards are repeated in differ¬ 
ent chapters. When such repetition occurs, 
the essence of the standard is the same. Only 
the wording, area of application, or elabora¬ 
tion in the comment is changed. Third, 
standards dealing wirh important nontechni¬ 
cal issues, such as avoiding conflicts of inter¬ 
est and equitable treatment of all test takers, 
have been added. Although such topics have 
not been addressed in prior versions of the 
Standards, they are not likely to be viewed as 
imposing burdensome new requirements. 
Thus che increase in the number of stan¬ 
dards does not per se signal an increase in 
the obligations placed on test developers 
and test users. 

Tests as Measures of Constructs 

We depart from some historical uses of the 
term “construct," which reserve the term for 
characteristics that are nor directly observable, 
but which are inferred from interrelated sets 
of observations. This historical perspective 
invites confusion. Some tests are viewed as 
measures of constructs, while others are not. 
In addition, considerable debate has ensued 
as to whether certain characteristics measured 
by tests are properly viewed as constructs. 
Furthermore, the types of validity evidence 
thought to be suitable can differ as a result 
of whether a given test is viewed as measur¬ 
ing a construct. 

We use the term construct more broadly 
as the concept or characteristic that a test is 
designed ro measure. Rarely, if ever, is there a 
single possible meaning that can be attached 
to a test score or a pattern of test responses. 
Thus, it is always incumbent on a resting 
professional to specify the construct interpre¬ 
tation that will be made on the basis of the 


score or response parcern. The notion that 
some tests are not under the purview of the 
Standards because they do nor measure con¬ 
structs is contrary to this use of the term, 
Also, as detailed in chapter 1, evolving con¬ 
ceptualizations of the concept of validity no 
longer speak of different types of validity but 
speak instead of different lines of validity evi¬ 
dence, all in service of providing information 
relevant to a specific intended interpretation 
of test scores. Thus, many lines of evidence 
can contribute to an understanding of the 
construct meaning of test scores. 

Organization of This Volume 

Part I of the Standards, “Test Construction, 
Evaluation, and Documentation,” contains 
standards for validity (ch. 1); reliability and 
errors of measurement (ch. 2); test develop¬ 
ment and revision (ch. 3); scaling, norming, 
and score comparability (ch. 4); test adminis¬ 
tration, scoring, and reporting (ch. 5); and 
supporting documentation for tests (ch. 6). 
Parr II addresses “Fairness in Testing," and 
contains standards on fairness and bias (ch. 7); 
the rights and responsibilities of test takers 
(ch. 8)-, testing individuals of diverse linguis¬ 
tic backgrounds (ch. 9); and testing individu¬ 
als with disabilities (ch. 10). Part III treats 
specific "Testing Applications," and contains 
standards involving general responsibilities of 
test users (ch. 11); psychological testing and 
assessment (ch. 12); educational testing and 
assessment (ch. 13); testing in employment 
and credentialing (ch. 14); and testing in pro¬ 
gram evaluation and public policy (ch. 13). 

Each chapter begins with introductory 
text that provides background for the stan¬ 
dards that follow. This revision of the 
Standards contains more extensive intro¬ 
ductory text material than its predecessor. 
Recognizing the common use of the Standards 
in the education of future test developers 
and users, the committee opted to provide a 
context for the standards themselves by pre- 


JA2612 



Case l:14-cv-00857-TSC Document 60-85 Filed 12/21/15 Page 18 of 100 
USCA Case #17-7035 Document #1715850 Filed: 01/31/2018 Page 309 of 517 

INTRODUCTION 


senting more background material than in 
previous versions. This text is designed to 
assist in the interpretation of the standards 
that follow in each chapter. Although the text 
is at times prescriptive and exhortatoty, it 
should not be interpreted as imposing addi¬ 
tional standards. 

The Standards also contains an index and 
includes a glossary that provides definitions 
for terms as they are specifically used in this 
volume. 
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Background 

VaJidity refers to the degree to which evidence 
and theory support the interpretations of test 
scores entailed by proposed uses of tests. 
Validity is, therefore, the most fundamental 
consideration in developing and evaluating 
tests. The process of validation involves accu¬ 
mulating evidence to provide a sound scientific 
basis for the proposed score interpretations. 
It is the interpretations of test scores required 
by proposed uses that are evaluated, not the 
test itself. When test scores arc used or inter¬ 
preted in more than one way, each intended 
interpretation must be validated. 

Validation logically begins with an explicit 
statement of the proposed interpretation of 
test scores, along with a rationale for the rele¬ 
vance of the interpretation to the proposed 
use. The proposed interpretation refers to the 
construct or concepts the test is intended to 
measure. Examples of constructs are mathe¬ 
matics achievement, performance as a com¬ 
puter technician, depression, and self-esteem. 
To support test development, the proposed 
interpretation is elaborated by describing 
its scope and extent and by delineating the 
aspects of the construct that are to be repre¬ 
sented. The detailed description provides a 
conceptual framework for the test, delineat¬ 
ing the knowledge, skills, abilities, processes, 
or characteristics 10 be assessed. The frame¬ 
work indicates how this representation of 
the construct is ro be distinguished from 
orher constructs and how it should relate 
ro other variables. 

The conceptual framework is partially 
shaped by the ways in which test scores will 
be used. For instance, a test of mathematics 
achievement might be used ro place a srudent 
in an appropriate program of instruction, to 
endorse a high school diploma, or to inform 
a college admissions decision. Each of these 
uses implies a somewhat different interpre¬ 
tation of the mathematics achievement test 


scores: that a student will benefit from a 
particular instructional intervention, that a 
student has mastered a specified curriculum, 
or that a student is likely to be successful 
with college-level work. Similarly, a test of 
self-esteem might be used for psychological 
counseling, to inform a decision about 
employment, or for the basic scientific pur¬ 
pose of elaborating rhe construct of self-esteem. 
Each of these potential uses shape the specified 
framework and the proposed interpretation of 
the tests scores and also has implications for 
test development and evaluation. 

Validation can be viewed as developing a 
scientifically sound validity argument to sup¬ 
port the intended interpretation of test scores 
and their relevance to the proposed use. The 
conceptual framework points to the kinds of 
evidence that might be collected to evaluate 
the proposed interpretation in light of rhe 
purposes of testing. As validation proceeds, 
and new evidence about rhe meaning of a 
test’s scores becomes available, revisions may 
be needed in the test, in the conceptual 
framework that shapes it. and even in the 
construct underlying the test. 

The wide variety of tests and circum¬ 
stances makes it natural that some types of 
evidence will be especially critical in a given 
case, whereas other types will be less useful. 
The decision about what types of evidence 
arc important for validation in each instance 
can be clarified by developing a set of propo¬ 
sitions that support the proposed interpretar.on 
for the particular purpose of testing. For 
instance, when a mathematics achievement 
test is used to assess readiness for an advanced 
course, evidence for the following proposi¬ 
tions might be deemed necessary: (a) that cer¬ 
tain skills are prerequisite for the advanced 
course; (b) that the content domain of the 
test is consistent with these prerequisite skills; 

(c) that test scores can be generalized across 
relevant sets of items; (d) that test scores are 
not unduly influenced by ancillary variables. 
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such as writing ability; (e) that success in the 
advanced course can be validly assessed; and 
(f) that examinees with high scores on the 
test will be more successful in the advanced 
course than examinees with low scores on the 
rest. Examples of propositions in other testing 
contexts might include, for instance, the 
ptoposition that examinees with high general 
anxiety scores experience significant anxiety 
in a range of settings, the proposition that a 
child's score on an intelligence scale is strong¬ 
ly related to the child's academic performance, 
or the proposition that a certain pattern of 
scores on a neuropsychological battery indi¬ 
cates impairment characteristic of brain injury. 
The validation process evolves as these propo¬ 
sitions are articulated and evidence is gathered 
to evaluate their soundness. 

Identifying the propositions implied by 
a proposed test interpretation can be facili¬ 
tated by considering rival hypotheses that 
may challenge the proposed interpretation. 
It is also useful to consider the perspectives 
of different interested parties, existing expe¬ 
rience with similar tests and contexts, and 
the expected consequences of the proposed 
test use. Plausible rival hypotheses can often 
be generated by considering whether a test 
measures less or mote than its proposed 
construct. Such concerns are referred to as 
construct underrepresentation and construct- 
irrelevant variance. 

Construct underrepresentation refers to 
the degree to which a test fails to capture 
important aspects of the construct. It implies 
a narrowed meaning of test scores because 
the test does not adequately sample some 
types of content, engage some psychological 
processes, or elicit some ways of responding 
that are encompassed by the intended con¬ 
struct. Take, for example, a test of reading 
comprehension intended to measure chil¬ 
dren's ability to read and interpret stories 
with understanding. A particular test might 
underrepresent the intended construct because 
it did nor contain a sufficient variety of read- 

!0 


ing passages or ignored a common type of 
reading material. As another example, a test 
of anxiety might measure only physiological 
reactions and not emotional, cognitive, or 
situational components. 

Construct-irrelevant variance refers to 
che degree to which test scores are affected by 
processes that are extraneous to its intended 
construct. The test scores may be systemati¬ 
cally influenced to some extent by compo¬ 
nents that are not pare of the construct. In 
the case of a reading comprehension test, 
construct-irrelevant components might 
include an emotional reaction to the test 
content, familiarity with the subject matter 
of the reading passages on the test, or the 
writing skill needed to compose a response. 
Depending on the detailed definition of the 
construct, vocabulary knowledge or reading 
speed might also be irrelevant components. 
On a test of anxiety, a response bias to under¬ 
report anxiety might be considered a source 
of construct-irrelevant variance. 

Nearly all tests leave out elements that 
some potential users believe should be meas¬ 
ured and include some elements that some 
potential users consider inappropriate. 
Validation involves careful attention to possible 
distortions in meaning arising from inadequate 
representation of the construct and also to 
aspects of measurement such as test format, 
administration conditions, or language level 
that may materially limit or qualify the inter¬ 
pretation of test scores. Thar is, the process 
of validation may lead to revisions in the test, 
the conceptual framework of the test, or both. 
The revised test would then need validation. 

When propositions have been identified 
chat would support the proposed interpretation 
of test scores, validation can proceed by devel¬ 
oping empirical evidence, examining relevant 
literature, and/or conducting logical analyses to 
evaluate each of these propositions. Empirical 
evidence may include both local evidence, 
produced within the contexts where the test 
will be used, and evidence from similar testing 
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applications in other settings. Use of existing 
evidence from similar tests and contexts can 
enhance the quality of the validity argument, 
especially when current data ate limited. 

Because a validity argument typically 
depends on more than one proposition, srrong 
evidence in support of one in no way dimin¬ 
ishes the need for evidence to support others. 
For example, a strong predictor-criterion rela¬ 
tionship in an employment setting is not suf¬ 
ficient to justify test use for selection without 
considering the appropriateness and meaning- 
fulness of the criterion measure. Professional 
judgment guides decisions regarding the spe¬ 
cific forms of evidence that can best support 
the intended interpretation and use. As in 
all scientific endeavors, the quality of the 
evidence is primary. A few lines of solid evi¬ 
dence regarding a particular proposition are 
better than numerous lines of evidence of 
questionable quality. 

Validation is the joint responsibility of 
the test developer and the test user. The test 
developer is responsible for furnishing rele¬ 
vant evidence and a rationale in support of 
the intended test use. The cesr user is uldmateiy 
responsible for evaluating the evidence in the 
particular setting in which the test is to be 
used. When the use of a test differs from that 
supported by the test developer, the test user 
bears special responsibility for validation. The 
standards apply to the validation process, for 
which the appropriate parties share responsi¬ 
bility. It should be noted chat important con¬ 
tributions to the validity evidence are made as 
other researchers report findings of investiga¬ 
tions that are related to the meaning of scores 
on the rest. 

Sources of Validity Evidence 

The following sections outline various sources 
of evidence that might be used in evaluating a 
proposed interpretation of test scores for par¬ 
ticular purposes. These sources of evidence 
may illuminate different aspects of validity, 


but they do not represent distinct types of 
validity. Validity is a unitary concept. It is the 
degree to which all the accumulated evidence 
supports the intended interpretation of test 
scores for the proposed purpose. Like the 
1985 Standards, this edition refers to types of 
validity evidence, rather than distinct types of 
validity. To emphasize this distinction, the 
treatment that follows does nor follow tradi¬ 
tional nomenclature (i.e,, the use of che terms 
content validity or predictive validity). The 
glossary contains definitions of the traditional 
terms, explicating the difference between tra¬ 
ditional and current use. 

Evidence Based on Test Content 

Important validity evidence can be obtained 
from an analysis of the relationship between a 
test’s content and the construct it is intended 
to measure. Test content refers to the themes, 
wording, and format of the items, tasks, or 
questions on a test, as well as the guidelines for 
procedures regarding administration and scor¬ 
ing. Test developers often work from a specifi¬ 
cation of the content domain. The content 
specification carefully describes the content in 
detail, often with a classification of areas of 
content and types of items. Evidence based on 
test content can include logical or empirical 
analyses of the adequacy with which the test 
content represents the content domain and of 
the relevance of the content domain to the 
proposed interpretation of test scores. Evidence 
based on content can also come from expert 
judgments of the relationship berween parts 
of the test and the construct. For example, in 
developing a licensure test, the major facets of 
the specific occupation can be specified, and 
experts in thar occupation can be asked to 
assign test items to the categories defined by 
those facets. They, or other qualified experts, 
can then judge the representativeness of the 
chosen set of items. Sometimes rules or algo¬ 
rithms can be constructed to select or generate 
items that differ systematicaliy on the various 
facets of content, according to specifications. 
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Some tesis arc based on systematic obser¬ 
vations of behavior. For example, a listing of 
the tasks comprising a job domain may be 
developed from observations of behavior in a 
job, together with judgments ofsubjccr-matter 
experts. Expert judgments can be used to assess 
the relative importance, criticality, and/or fre¬ 
quency of the various tasks. A job sample test 
can then be constructed from a random or 
stratified sampling of tasks rated highly on 
these characteristics. The test can then be 
administered under standardized conditions 
in an off-thc-job setting. 

The appropriateness of a given content 
domain is related to the specific inferences to 
be made from test scores. Thus, when consid¬ 
ering an available test for a purpose other than 
that for which it was first developed, it is 
especially important to evaluate the appropri¬ 
ateness of the original content domain for the 
proposed new use. In educational program 
evaluations, for example, tests may properly 
cover material that receives little or no atten¬ 
tion in the curriculum, as well as that toward 
which instruction is directed. Policymakers 
can then evaluate student achievement with 
respect to both content neglected and content 
addressed. On the other hand, when student 
mastery of a delivered curriculum is tested for 
purposes of informing decisions about indi¬ 
vidual students, such as promotion or gradua¬ 
tion, the framework elaborating a content 
domain is appropriately limited to what stu¬ 
dents have had an opportunity to learn from 
the curriculum as delivered. 

Evidence about content can be used, in 
part, to address questions about differences in 
the meaning or interpretation of test scores 
across relevant subgroups of examinees. Of 
particular concern is the extent to which con¬ 
struct underrepresentation or construct-irrele¬ 
vant components may give an unfair advantage 
or disadvantage to one or more subgroups of 
examinees. Careful review of the construct 
and test content domain by a diverse panel 
of experts may point to potential sources of 

12 


irrelevant difficulty (or easiness) that require 
further investigation. 

Evidence Based on Response Processes 

Theoretical and empirical analyses of the 
response processes of test takers can provide 
evidence concerning the fit between the con¬ 
struct and the detailed nature of performance 
or response actually engaged in by examinees. 
For instance, if a tesr is intended to assess 
mathematical reasoning, it becomes impor¬ 
tant to determine whether examinees are, in 
fact, reasoning about the material given instead 
of following a standard algorithm. For another 
instance, scores on a scale intended to assess 
the degree of an individual’s extroversion or 
introversion should not be strongly influenced 
by social conformity. 

Evidence based on response processes 
generally comes from analyses of individual 
responses. Questioning test takers about their 
performance strategies or responses to partic¬ 
ular items can yield evidence that enriches the 
definition of a construct. Maintaining records 
that monitor the development of a response 
to a writing task, through successive written 
drafts or electronically monitored revisions, 
for instance, also provides evidence of process. 
Documentation of other aspects of performance, 
like eye movements or response times, may 
also be relevant to some constructs. Inferences 
about processes involved in performance can 
also be developed by analyzing the relationship 
among parts of the test and between the test 
and other variables. Wide individual differ¬ 
ences in process can be revealing and may lead 
to reconsideration of certain test formats. 

Evidence of response processes can 
contribute to questions about differences in 
meaning or interpretation of test scores across 
relevant subgroups of examinees. Process stud¬ 
ies involving examinees from different sub¬ 
groups can assist in determining the extent to 
which capabilities irrelevant or ancillary to the 
construct may be differentially influencing 
their performance. 
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Studies of response processes are not lim¬ 
ited to rhe examinee. Assessments often rely 
on observers or judges to record and/or evalu¬ 
ate examinees’ performances or products. In 
such cases, relevant validity evidence includes 
rhe extent to which the processes of observers 
or judges are consistent with the intended 
interpretation of scores. For instance, if 
judges are expected to apply particular criteria 
in scoring examinees' performances, it is 
important to ascertain whether they are, in 
fact, applying the appropriate criteria and not 
being influenced by factors that are irrelevant 
to the intended interpretation. Thus, valida¬ 
tion may include empirical studies of how 
observers or judges record and evaluate data 
along with analyses of the appropriateness of 
these processes to the intended interpretation 
or construct definition. 

Evidence Based on Internal Structure 

Analyses of the internal structure of a 
test can indicate the degree to which the 
relationships among test items and test com¬ 
ponents conform to the construct on which 
the proposed test score interpretations are 
based. The conceptual framework for a test 
may imply a single dimension of behavior, 
or it may posit several components that are 
each expected to be homogeneous, but that 
are also distinct from each other. For exam¬ 
ple, a measure of discomfort on a health sur¬ 
vey might assess both physical and emotional 
health. The exrent to which item interrela¬ 
tionships bear out the presumptions of the 
framework would be relevant to validity. 

The specific types of analysis and their 
interpretation depend on how the test will 
be used. For example, if a particular appli¬ 
cation posited a series of test components of 
increasing difficulty, empirical evidence of 
the extent to which response patterns con¬ 
formed to this expectation would be provid¬ 
ed. A theory that posited unidimensionality 
would call for evidence of item homogene¬ 
ity- In this case, the item interrelationships 


also provide an estimate of score reliability, 
but such an index would be inappropriate for 
tests with a more complex internal structure. 

Some studies of the internal structure of 
tests are designed to show whether particular 
items may function differently for identifiable 
subgroups of examinees. Differential item 
functioning occurs when different groups 
of examinees with similar overall ability, or 
similar status on an appropriate criterion, 
have, on average, systematically different 
responses to a particular item. This issue is 
discussed in chapters 3 and 7- However, dif¬ 
ferential item functioning is not always a 
flaw or weakness. Subsets of items that have 
a specific characteristic in common (e.g., 
specific content, task representation) may 
function differently for different groups of 
similarly scoring examinees. This indicates 
a kind of multidimensionality that may be 
unexpected or may conform to the test 
framework. 

Evidence Based on Relations to Other Variables 

Analyses of the relationship of test scores 
to variables external to the test provide anoth¬ 
er important source of validity evidence. 
External variables may include measures of 
some criceria that the test is expected to pre¬ 
dict, as well as relationships to other tests 
hypothesized to measure the same constructs, 
and tests measuring related or different con¬ 
structs. Measures other than test scores, such 
as performance criteria, ate often used in 
employment settings. Categorical variables, 
including group membership variables, 
become relevant when rhe theory underlying 
a proposed test use suggests that group differ¬ 
ences should be present or absent if a pro¬ 
posed test interpretation is to be supported. 
Evidence based on relationships with other 
variables addresses questions about the degree 
to which these relationships ate consistent 
with the construct underlying the proposed 
test interpretations. 
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Convergent and discriminant evidence. 

Relationships between test scores and other 
measures intended to assess similar constructs 
provide convergent evidence, whereas rela¬ 
tionships between test scores and measures 
purportedly of different constructs provide 
discriminant evidence. For instance, within 
some theoretical frameworks, scores on a 
multiple-choice test of reading comprehen¬ 
sion might be expected to relate closely 
(convergent evidence) to other measures of 
reading comprehension based on other meth¬ 
ods, such as essay responses; conversely, test 
scores might be expected to relate less closely 
(discriminant evidence) to measures of other 
skills, such as logical reasoning. Relationships 
among different methods of measuring the 
construct can be especially helpful in sharp¬ 
ening and elaborating score meaning and 
interpretation. 

Evidence of relations with other variables 
can involve experimental as well as correla¬ 
tional evidence. Studies might be designed, 
for instance, to investigate whether scores on 
a measure of anxiety improve as a result of 
some psychological treatment or whether 
scores on a test of academic achievement dif¬ 
ferentiate between instructed and nonin- 
structed groups. If performance increases due 
to short-term coaching arc viewed as a threat 
ro validity, it would be useful to investigate 
whecher coached and uncoachcd groups per¬ 
form differently. 

Test-criterion relationships. Evidence of 
the relation of test scores to a relevant criterion 
may be expressed in various ways, but the 
fundamental question is always: How accu¬ 
rately do test scores predict criterion per¬ 
formance? The degree of accuracy deemed 
necessary depends on the purpose for which 
the test is used. 

The criterion variable is a measure of some 
attribute or outcome that is of primary inter¬ 
est, as determined by test users, who may be 
administrators in a school system, the man¬ 
agement of a firm, or clients. The choice of 


the criterion and the measurement procedures 
used to obtain criterion scores are of central 
importance. The value ofa test-criterion study 
depends on the relevance, reliability, and validity 
of the interpretation based on the criterion 
measure for a given testing application. 

Historically, two designs, often called 
predictive and concurrent, have been distin¬ 
guished for evaluating test-criterion relation¬ 
ships. A predictive study indicates how 
accurately test data can predict criterion scores 
that are obtained at a later time. A concurrent 
study obtains predictor and criterion infor¬ 
mation at about the same time. When predic¬ 
tion is actually contemplated, as in education 
or employment sertings, or in planning reha¬ 
bilitation regimens, predictive studies can 
retain the temporal differences and other 
characteristics of the practical situation. 
Concurrent evidence, which avoids temporal 
changes, is particularly useful for psychodiag- 
noStic tests or to investigate alternative meas¬ 
ures of some specified construct. In general, 
the choice of research strategy is guided by 
prior evidence of the extent to which predic¬ 
tive and concurrent studies yield the same or 
different results in the domain. 

Test scores are sometimes used in allocat¬ 
ing individuals to different treatments, such as 
different jobs within an institution, in a way 
that is advantageous for the institution and for 
the individuals. In that context, evidence is 
needed to judge the suitability of using a test 
when classifying or assigning a person to one 
job versus another or to one treatment versus 
another. Classification decisions arc supported 
by evidence that the relationship of test scores 
to performance criteria is different for different 
treatments. It is possible for tests to be highly 
predictive of performance for different educa¬ 
tion programs or jobs without providing the 
information necessary to make a comparative 
judgment of the efficacy of assignments or 
treatments. In general, decision rules for 
selection or placement are also influenced by 
the number of persons to be accepted or the 
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numbers that can be accommodated in alter¬ 
native placement categories. 

Evidence about relations to other vari¬ 
ables is also used to investigate questions of 
differential prediction for groups. For instance, 
a finding that the relation of test scores to a 
relevant criterion variable differs from one 
group to another may imply that the mean¬ 
ing of the scores is not the same for members 
of the different groups, perhaps due to con¬ 
struct underrepresentation or construct-irrele¬ 
vant components. However, the difference 
may also imply that the criterion has different 
meaning for different groups. The differences 
in test-criterion relationships can also arise 
from measurement error, especially when 
group means differ, so such differences do 
not necessarily indicate differences in score 
meaning. (See chapter 7.) 

Validity generalization. An important 
issue in educational and employment settings 
is the degree to which evidence of validity 
based on test-cricerion relations can be gener¬ 
alized to a new situation without further study 
of validity in that new situation. When a test 
is used to predict the same or similar criteria 
(e.g., performance of a given job) at different 
times or in different places, it is typically found 
that observed test-criterion correlations vary 
substantially. In the past, this has been taken 
to imply that local validation studies are always 
requited. More recendy, meta-analytic analyses 
have shown that in some domains, much of 
chis variability may be due to statisdeal artifacts 
such as sampling fluctuations and variations 
across validation studies in the ranges of test 
scores and in the reliability of criterion meas¬ 
ures. When these and other influences are taken 
into account, it may be found that the remain¬ 
ing variability in validity coefficients is relatively 
small. Thus, statistical summaries of past vali¬ 
dation studies in similar situations may be 
useful in estimaung test-criterion relationships 
in a new situation. This practice is referred to 
as the study of validity generalization. 


In some circumstances, there is a strong 
basis for using validity generalization. This 
would be the case where the meta-analytic 
database is large, where the meta-analytic data 
adequately represent the type of situation to 
which one wishes to generalize, and where 
correction for statistical artifacts produces a 
clear and consistent pattern of validity evi¬ 
dence In such circumstances, the informa¬ 
tional value of a local validity study may be 
relatively limited. In other circumstances, the 
inferential leap required for generalization 
may be much larger. The meta-analytic data¬ 
base may be small, the findings may be less 
consistent, or the new situation may involve 
features markedly different from those repre¬ 
sented in the meta-analytic database. In such 
circumstances, situation-specific evidence of 
validity will be relatively more informative. 
Although research on validity generalization 
shows that results of a single local validation 
study may be quite imprecise, there arc situa¬ 
tions where a single study, carefully done, 
with adequate sample size, provides sufficient 
evidence to support test use in a new situa¬ 
tion. This highlights the importance of exam¬ 
ining carefully the comparative informational 
value of local versus meta-analytic studies. 

In conducting studies of the generaliz- 
ability of validity evidence, the prior studies 
that are included may vary according to sev¬ 
eral situational facets. Some of the major 
facets are (a) differences in the way the pre¬ 
dictor construct is measured, (b) the type of 
job or curriculum involved, (c) the type of 
criterion measure used, (d) the type of test 
takers, and (e) the time period in which the 
study was conducted. In any particular srudy 
of validity generalization, any number of these 
facets might vary, and a major objective of the 
study is to determine empirically the extent 
to which variation in these facets affects the 
test-criterion correlations obtained. 

The extent to which predictive or con¬ 
current evidence of validity generalization can 
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be used in new situations is in large measure 
a function of accumulated research. Although 
evidence of generalization can often help ro 
support a claim of validity in a new situation, 
the extent of available data limits the extent to 
which the claim can be Sustained. 

The above discussion focuses on the use 
of cumulative databases to estimate predictor- 
criterion relationships. Meta-analytic tech¬ 
niques can also be used to summarize other 
forms of data relevant to other inferences one 
may wish to draw from test scores in a partic¬ 
ular application, such as effects of coaching 
and effects of certain alterations in testing 
conditions to accommodate test takers with 
certain disabilities. 

Evidence Based on Consequences of Testing 

An issue receiving attention in recent 
years is the incorporation of the intended and 
unintended consequences of test use into the 
concept of validity, Evidence about conse¬ 
quences can inform validity decisions. Here, 
however, it is important to distinguish 
between evidence chat is directly relevant to 
validity and evidence that may inform deci¬ 
sions about social policy but falls outside 
the realm of validity. 

Distinguishing between issues of validity 
and issues of social policy becomes particularly 
important in cases where differential conse¬ 
quences of test use are observed for different 
identifiable groups. For example, concerns 
have been raised about the effecr of group 
differences in test scores on employment 
selection and promotion, the placement of 
children in special education classes, and the 
narrowing of a school’s curriculum to exclude 
learning of objectives that are not assessed. 
Although information about the consequences 
of testing may influence decisions about test 
use, such consequences do not in and of 
themselves detract from the validity of intended 
test interpretations. Rather, judgments of 
validity or invalidity in the light of testing 


consequences depend on a more searching 
inquiry into the sources of those consequences. 

Take, as an example, a finding of different 
hiring rates for members of different groups as 
a consequence of using an employment test. If 
the difference is due solely to an unequal distri¬ 
bution of the skills the test purports to meas¬ 
ure, and if rhose skills are, in fact, important 
contributors to job performance, then the ftnd- 
ing of group differences per se does nor imply 
any lack of validity for the intended inference. 
If however, the test measured skill differences 
unrelated to job performance (e.g., a sophisti¬ 
cated reading test for a job that required only 
minima) functional literacy), or if the differ¬ 
ences were due to the test's sensitivity to some 
examinee characteristic not intended to be part 
of the test construct, then validity would be 
called into question, even if test scores correlat¬ 
ed positively with some measure of job per¬ 
formance. Thus, evidence about consequences 
may be directly relevant to validity when it can 
be traced to a source of invalidity such as con¬ 
struct underrepresentation or construct-irrele¬ 
vant components. Evidence about consequences 
that cannot be so traced—that in fact reflects 
valid differences in performance—is crucial in 
informing policy decisions bur falls outside the 
technical purview of validity. 

Tests are commonly administered in the 
expectation that some benefit will be realized 
from the intended use of the scores. A few of 
the many possible benefits are selection of 
efficacious treatments for therapy, placement 
of workers in suitable jobs, prevention of 
unqualified individuals from entering a pro¬ 
fession, or improvement of classroom instruc¬ 
tional practices. A fundamental purpose of 
validation is to indicate whether these specific 
benefits are likely to be realized. Thus, in the 
case of a test used in placement decisions, the 
validation would be informed by evidence 
that alternative placements, in fact, are dif¬ 
ferentially beneficial to the persons and the 
institution. In the case of employment testing, 
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if a tat publisher claims that use of the test 
will result in reduced employee training costs, 
improved workforce efficiency, or some other 
benefit, then the validation would be informed 
by evidence in support of that claim. 

Claims are sometimes made for benefits 
of testing that go beyond direct uses of the 
rest scores themselves. Educational rests, for 
example, may be advocated on the grounds 
that their use will improve student motiva¬ 
tion or encourage changes in classroom 
instructional practices by holding educators 
accountable for valued learning outcomes. 
Where such claims are central to the rationale 
advanced for testing, the direcr examination 
of testing consequences necessarily assumes 
even greater importance. The validation 
process in such cases would be informed by 
evidence that the anticipated benefits of test¬ 
ing are being realized. 

Integrating the Validity Evidence 

A sound validity argument integrates various 
strands of evidence into a coherent account 
of the degree to which existing evidence and 
theory support the intended interpretation of 
test scores for specific uses. It encompasses 
evidence gathered from new studies and evi¬ 
dence available from earlier reported research. 
The validity argumenr may indicate the need 
for refining the definition of the construct, may 
suggest revisions in the test or other aspects 
of the testing process, and may indicate areas 
needing further study. 

Ultimately, the validity of an intended 
interpretation of test scores relies on all the 
available evidence relevant to the technical 
quality of a testing system. This includes evi¬ 
dence of careful test construction; adequate 
score reliability; appropriate test administration 
and scoring; accurate score scaling, equating, 
and standard setting; and careful attention to 
fairness for all examinees, as described in subse¬ 
quent chapters of the Standards, 


Standard 1.1 

A rationale should be presented for each rec¬ 
ommended interpretation and use of test 
scores, together with a comprehensive sum¬ 
mary of the evidence and theory bearing on 
the intended use or interpretation. 

Comment: The rationale should indicate what 
propositions are necessary to investigate the 
intended interpretation. The comprehensive 
summary should combine logical analysis 
with empirical evidence to provide support 
for the test rationale. Evidence may come 
from studies conducted locally, in the setting 
where the test is to be used; from specific 
prior studies; or from comprehensive statisti¬ 
cal syntheses of available studies meeting 
dearly specified criteria. No type of evidence 
is inherently preferable to others; rather, the 
quality and relevance of the evidence ro the 
intended test use determine the value of a 
particular kind of evidence. A presentation of 
empirical evidence on any point should give 
due weight to all relevant Findings in the sci¬ 
entific literature, including those inconsistent 
with the intended interprecation or use. Test 
developers have the responsibility to provide 
support for their own recommendations, but 
test users are responsible for evaluating the 
quality of the validity evidence provided and 
its relevance to the local situation. 

Standard 1.2 

The test developer should set forth clearly 
how test scores are intended to be interpret¬ 
ed and used. The population(s) for which a 
test is appropriate should be clearly delimit¬ 
ed, and the construct that the test is intend¬ 
ed to assess should be clearly described. 

Comment: Statements about validity should 
refer ro particular interpretations and uses. It 
is incorrect to use the unqualified phrase “the 
validity of the test.” No tesc is valid for all 
purposes or in all situations. Each recom- 
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mended use or interpretation requires valida¬ 
tion and should specify in clear language the 
population for which the test is intended, the 
construct it is intended to measure, and rhe 
manner and contexts in which test scores are 
to be employed. 

Standard 1.3 

If validity for some common or likely inter¬ 
pretation has not been investigated, or if the 
interpretation is inconsistent with available 
evidence, that fact should be made clear and 
potential users should be cautioned about 
making unsupported interpretations. 

Comment: If past experience suggests that a 
test is likely to be used inappropriately for 
certain kinds of decisions, specific warnings 
against such uses should be given. On the 
other hand, no two situations are ever identi¬ 
cal, so some generalization by the user is 
always necessary. Professional judgment is 
requited to evaluate the extent to which exist¬ 
ing validity evidence supports a given test use. 

Standard 1.4 

If a test is used in a way that has not been 
validated, it is incumbent on the user to jus¬ 
tify the new use, collecting new evidence if 
necessary. 

Comment. Professional judgment is required to 
evaluate the extent to which existing validity 
evidence applies in the new situation and to 
determine what new evidence may be needed. 
The amount and kinds of new evidence 
required may be influenced by experience with 
similar prior test uses or interpretations and 
by the amount, quality, and relevance of 
existing data. 

Standard 1.5 

The composition of any sample of exam¬ 
inees from which validity evidence is 


obtained should be described in as much 
detail as is practical, including major rele¬ 
vant sociodemographic and developmental 
characteristics. 

Comment: Statistical findings can be influ¬ 
enced by factors affecting the sample on 
which the results are based- When the sample 
is intended to represent a population, that 
population should be described, and atten¬ 
tion should be drawn to any systematic fee- 
tors that may limit the representativeness of 
the sample. Factors that might reasonably be 
expected to affect the results include self¬ 
selection, attrition, linguistic prowess, disabil¬ 
ity status, and exclusion criteria, and others. 

If the subjects of a validity study are patients, 
for example, then the diagnoses of the 
patients are important, as well as other char¬ 
acteristics, such as the severity of the diag¬ 
nosed condition. For tests used in industry, 
the employment status (e.g., applicants versus 
current job holders), the general level of expe¬ 
rience and educarional background and the 
gender and ethnic composition of the sample 
may be relevant information. For tests used 
in educational settings, relevant information 
may include educational background, devel¬ 
opmental level, community characteristics, or 
school admissions policies, as well as the gen¬ 
der and ethnic composition of the sample. 
Sometimes restrictions about privacy preclude 
obtaining such population information. 

Standard 1.6 

When the validation rests in part on the 
appropriateness of test content, the procedures 
followed in specifying and generating test con¬ 
tent should be described and justified in refer¬ 
ence to the construct the test is intended to 
measure or the domain it is intended to repre¬ 
sent. If the definition of the content sampled 
incorporates criteria such as importance, fre¬ 
quency, or criticality, these criteria should also 
be clearly explained and justified. 
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Comment: For example, test developers might 
provide a logical structure rhat maps rhe 
items on the test to the content domain, 
illustrating the relevance of each item and the 
adequacy with which the set of items repre¬ 
sents the content domain. Areas of the content 
domain that are not included among the test 
items could be indicated as well. 

Standard 1.7 

When a validation rests in part on the opin¬ 
ions or decisions of expert judges, observers, 
or raters, procedures for selecting such 
experts and for eliciting judgments or rat¬ 
ings should be hilly described. The qualifi¬ 
cations, and experience, of the judges should 
be presented. The description of procedures 
should include any training and instructions 
provided, should indicate whether partici¬ 
pants reached their decisions independently, 
and should report the level of agreement 
reached. If participants interacted with one 
another or exchanged information, the pro¬ 
cedures through which they may have influ¬ 
enced one another should be set forth. 

Comment: Systematic collection of judgments 
or opinions may occur ar many points in test 
construction (e.g., in eliciting expert judg¬ 
ments of content appropriateness or adequate 
content representation), in formulating rules 
or standards for score interpretation (e.g., in 
setting cut scores), or in test scoring (e.g., rat¬ 
ing of essay responses). Whenever such proce¬ 
dures arc employed, the quality of the resulting 
judgments is important to the validation. It 
may be entirely appropriate to have experts 
work together to reach consensus, but it would 
not then be appropriate to treat their respective 
judgments as statistically independent. 

Standard 1.8 

If the rationale for a test use or score inter¬ 
pretation depends on premises about the 
psychological processes or cognitive opera¬ 


tions used by examinees, then theoretical or 
empirical evidence in support of those prem¬ 
ises should be provided. When statements 
about the processes employed by observers 
or scorers are part of the argument for valid¬ 
ity, similar information should be provided. 

Comment: If the test specification delineates 
the processes to be assessed, then evidence is 
needed rhat rhe test items do, in Fact, rap rhe 
intended processes. 

Standard 1.9 

If a test is claimed to be essentially unaffect¬ 
ed by practice and coaching, then the sensi¬ 
tivity of test performance to change with 
these forms of instruction should be docu¬ 
mented. 

Comment: Materials to aid in score interpreta¬ 
tion should summarize evidence indicating 
the degree to which improvement with prac¬ 
tice or coaching can be expected. Also, materi¬ 
als written for rest takers should provide 
practical guidance about the value of test 
preparation activities, including coaching. 

Standard 1.10 

When interpretation of performance on spe¬ 
cific items, or small subsets of items, is sug¬ 
gested, the rationale and relevant evidence in 
support of such interpretation should be 
provided. When interpretation of individual 
item responses is likely but is not recom¬ 
mended by the developer, the user should be 
warned against making such interpretations. 

Comment: Users should be given sufficient 
guidance to enable them to judge the degree 
of confidence warranted for any use or inter¬ 
pretation recommended by the rest developer. 
Test manuals and score reports should dis¬ 
courage overinterpretation of information 
that may be subject to considerable error. 
This is especially important if interpretation 
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of performance on isolated items, small sub¬ 
sets of items, or subtest scores is suggested. 

Standard 1.11 

If the rationale for a test use or interpreta¬ 
tion depends on premises about the relation¬ 
ships among parts of the test, evidence 
concerning the internal structure of the test 
should be provided. 

Comment: It might be claimed, for example, 
that a test is essentially unidimensional. 
Such a claim could be supported by a mul¬ 
tivariate statistical analysis, such as a factor 
analysis, showing that the score variability 
attributable to one major dimension was 
much greater than the score variability 
attributable to any other identified dimen¬ 
sion. When a rest provides more than one 
score, the interrelationships of those scores 
should be shown to be consistent with the 
construct(s) being assessed. 

Standard 1.12 

When interpretation of subscores, score dif¬ 
ferences, or profiles is suggested, the ration¬ 
ale and relevant evidence in support of such 
interpretation should be provided. Where 
composite scores are developed, the basis 
and rationale for arriving at the composites 
should be given. 

Comment: When a test provides more than 
one score, the distinctiveness of the separate 
scores should be demonstrated, and the inter¬ 
relationships of those scores should be shown 
to be consistent with the construct(s) being 
assessed. Moreover, evidence for the validity 
of interpretations of two separate scores would 
not necessarily justify an interpretation of the 
difference between them. Rather, the rationale 
and supporting evidence must pertain directly 
to the specific score or score combination to 
be interpreted or used. 


Standard 1.13 

When validity evidence includes statistical 
analyses of test results, either alone or 
together with data on other variables, the 
conditions under which the data were col¬ 
lected should be described in enough detail 
that users can judge the relevance of the 
statistical findings Co local conditions. 
Attention should be drawn to any features 
of a validation data collection that are likely 
to differ from typical operational testing 
conditions and that could plausibly influ¬ 
ence test performance. 

Comment: Such conditions might include 
(bur would not be limired to) the following: 
examinee motivation or prior preparation, the 
distribution of test scores over examinees, the 
time allowed for examinees ro respond or 
other administrative conditions, examiner 
training or other examiner characteristics, 
the time intervals separating collection of 
data on different measures, or conditions 
that may have changed since the validity 
evidence was obtained. 

Standard 1.14 

When validity evidence includes empirical 
analyses of test responses together with data 
on other variables, die rationale for selecting 
the additional variables should be provided. 
Where appropriate and feasible, evidence 
concerning the constructs represented by 
other variables, as well as their technical 
properties, should be presented or cited. 
Attention should be drawn to any likely 
sources of dependence (or lack of independ¬ 
ence) among variables other than dependen¬ 
cies among die construct(s) they represent. 

Comment: The patterns of association 
between and among scores on die instrument 
under study and other variables should be 
consistent with theoretical expectations. The 
additional variables might be demographic 
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characteristics, indicators of treatment condi¬ 
tions, or scores on other measures. They 
mighc include intended measures of the same 
construct or of different constructs. The relia¬ 
bility of scores from such other measures and 
the validity of intended interpretations of 
scores from these measures arc an important 
part of the validity evidence for the instru¬ 
ment under study. If such variables include 
composite scores, the construction of the 
composites should be explained. In addition 
to considering the properties of each variable 
in isolation, it is important to guard against 
faulty interpretations arising from spurious 
sources of dependency among measures, 
including correlated errors or shared variance 
due to common methods of measurement or 
common elements. 

Standard 1.15 

When it is asserted that a certain level of 
test performance predicts adequate or 
inadequate criterion performance, informa¬ 
tion about the levels of criterion perform¬ 
ance associated with given levels of test 
scores should be provided. 

Comment: Regression equations are mote use¬ 
ful than correlation coefficients, which are 
generally insufficient to fully describe patterns 
of association between tests and other vari¬ 
ables. Means, standard deviations, and other 
statistical summaries arc needed, as well as 
information about the distribution of criteri¬ 
on performances conditional upon a given 
test score. Evidence of overall association 
between variables should be supplemented by 
information about the form of that associa¬ 
tion and about the variability associated with 
that association in different ranges of test 
scores. Note that dara collections employing 
examinees selected for their extreme scores on 
one or more measures (extreme groups) typi¬ 
cally cannor provide adequate information 
about the association. 


Standard 1.16 

When validation relies on evidence that test 
scores are related to one or more criterion 
variables, information about the suitability 
and technical quality of the criteria should 
be reported. 

Comment The description of each criterion 
variable should include evidence concerning 
its reliability, the extent to which it represents 
the intended construct (e.g.. job performance), 
and the extent to which it is likely to be 
influenced by extraneous sources of variance. 
Special attention should be given to sources 
rhat previous research suggests may introduce 
extraneous variance that might bias the crite¬ 
rion for or against identifiable groups 

Standard 1.17 

If test scores are used in conjunction with 
other quantifiable variables to predict some 
outcome or criterion, regression (or equiva¬ 
lent) analyses should include those additional 
relevant variables along with the test scores. 

Comment: In general, if several predictors of 
some criterion are available, the optimum 
combination of predictors cannot be deter¬ 
mined solely from separate, pairwise examina¬ 
tions of the criterion variable with each 
separate predictor in turn. It is often informa¬ 
tive to estimate the increment in predictive 
accuracy that may be expected when each 
variable, including the test score, is intro¬ 
duced in addition to all other available vari¬ 
ables. Analyses involving multiple predictors 
should be verified by cross-validation or 
equivalent analysis whenever feasible, and the 
precision of estimated regression coefficients 
should be reported. 

Standard 1.18 

When statistical adjustments, such as those 
for restriction of range or attenuation, are 
made, both adjusted and unadjusted coeffi- 
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dents, as well as the specific procedure used, 
and all statistics used in the adjustment, 
should be reported. 

Comment: The correlation between two vari¬ 
ables, such as test scores and criterion meas¬ 
ures, depends on the range of values on each 
variable. For example, the test scores and the 
criterion values of selected applicants will typi¬ 
cally have a smaller range than the scores of 
all applicants. Statistical methods are available 
for adjusting the correlation to reflect the 
population of interest rather than the sample 
available. Such adjustments are often appro¬ 
priate, as when comparing results across 
various situations. Reporting an adjusted 
correlation should be accompanied by a state¬ 
ment of the method and the statistics used in 
making the adjustment. 

Standard 1.19 

If a test is recommended for use in assigning 
persons to alternative treatments or is likely 
to be so used, and if outcomes from those 
treatments can reasonably be compared on a 
common criterion, then, whenever feasible, 
supporting evidence of differential outcomes 
should be provided. 

Comment: If a test is used for classification 
into alternative occupational, therapeutic, or 
educational programs, it is nor sufficient just 
to show that the test predicts treatment out¬ 
comes. Support for the validity of the classifi¬ 
cation procedure is provided by showing that 
the tese is useful in determining which per¬ 
sons arc likely to profit differentially from 
one treatment or another. Treatment cate¬ 
gories may have to be combined to assemble 
sufficient cases for statistical analysis. It is rec¬ 
ognized, however, that such research may not 
be feasible, because ethical and legal con¬ 
straints on differential assignments may for¬ 
bid control groups. 


Standard 1.20 

When a meta-analysis is used as evidence of 
the strength of a test-criterion relationship, 
the test and the criterion variables in the 
local situation should be comparable with 
those in the studies summarized. If relevant 
research includes credible evidence that any 
other features of the testing application may 
influence the strength of the test-criterion 
relationship, the correspondence between 
those features in the local situation and in 
the meta-analysis should be reported. Any 
significant disparities that might limit the 
applicability of the meta-analytic findings to 
the local situation should be noted explicitly. 

Comment: The meta-analysis should incorpo¬ 
rate all available studies meeting explicitly 
stated inclusion criteria. Meta-analytic evi¬ 
dence used in test validation typically is based 
on a number of tests measuring the same or 
very similar constructs and criterion measures 
that likewise measure the same or similar 
constructs. A meta-analytic study may also be 
limited to a single test and a single criterion. 
For each study included in the analysis, the 
rest-criterion relationship is expressed in some 
common metric, often as an effect size. The 
strength of the test-criterion relationship may 
be moderated by features of the situation in 
which the test and criterion measures were 
obtained (e.g., types of jobs, characteristics of 
test takers, time interval separating collection 
of test and criterion measures, year or decade 
in which the data were collected). If test-cri¬ 
terion relationships vary according to such 
moderator variables, then, the numbers of 
scudies permuting, the meta-analysis should 
report separate estimated effect size distribu¬ 
tions conditional upon relevant situational 
features. This might be accomplished, for 
example, by reporting separate distributions 
for subsets of studies or by estimating the 
magnitudes of the influences of situarional 
features on effect sizes. 


22 


JA2628 


AERA_APA_NCME_0000032 



Case ri4-cv-00857-TSC Document 60-85 Filed 12/21/15 Page 34 of 100 
USCA Case #17-7035 Document #1715850 F iled: 01/31/2018 Page 325 of 517 

PART I / VALIDITY standards! 


Standard 1.21 

Any meta-analytic evidence used to support 
an intended test use should be clearly 
described, including methodological choices 
in identifying and coding studies, correcting 
for artifacts, and examining potential mod¬ 
erator variables. Assumptions made in cor¬ 
recting for artifacts such as criterion 
unreliability and range restriction should be 
presented, and the consequences of these 
assumptions made clear. 

Comment: Meta-analysis inevitably involves 
judgments regarding a number of method¬ 
ological choices. The bases for these judg¬ 
ments should be articulated. In the case of 
choices involving some degree of uncertainty, 
such as artifact corrections based on assumed 
values, the uncertainty should be acknowl¬ 
edged and the degree to which conclusions 
about validity hinge on these assumptions 
should be examined and reported. 

Standard 1.22 

When it is clearly stated or implied that a 
recommended test use will result in a specif¬ 
ic outcome, the basis for expecting that out¬ 
come should be presented, together with 
relevant evidence. 

Comment: if it is asserted, for example, that 
using a given test for employee selection will 
result in reduced employee errors or training 
costs, evidence in support of that assertion 
should be provided. A given claim for the 
benefits of test use may be supported by logi¬ 
cal or theoretical argument as well as empiri¬ 
cal data. Due weight should be given to 
findings in the scientific literature that may 
be inconsistent with the stated expectation. 

Standard 1.23 

When a test use or score interpretation is 
recommended on the grounds that testing or 


the testing program pet se will result in 
some indirect benefit in addition to the util¬ 
ity of information from the test scores them¬ 
selves, the rationale for anticipating the 
indirect benefit should be made explicit. 
Logical or theoretical arguments and empiri¬ 
cal evidence for the indirect benefit should 
be provided. Due weight should be given to 
any contradictory findings in the scientific 
literature, including findings suggesting 
important indirect outcomes other than 
those predicted. 

Comment: For example, certain educational 
icscing programs have been advocated on 
the grounds thar they would have a salutary 
influence on classroom instructional practices 
or would clarify students' understanding of 
the kind or level of achievement they were 
expected to attain. To the extent that such 
claims enter into the justification for a testing 
program, they become part of the validity 
argument for test use and so should be exam¬ 
ined as part of the validation effort. Due 
weight should be given to evidence against 
such predictions, for example, evidence that 
under some conditions educational testing 
may have a negative effect on classroom 
instruction. 

Standard 1,24 

When unintended consequences result from 
test use, an attempt should be made to 
investigate whether such consequences arise 
from the test's sensitivity to characteristics 
other than those it is intended to assess or 
to the test's failure fully to represent the 
intended construct. 

Comment: The validity of test score interpre¬ 
tations may be limited by construct-irrelevant 
components or construct underrepresentation 
When unintended consequences appear to 
stem, at least in part, from the use of one or 
more tests, it is especially important to check 
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that these consequences do not arise from 
such sources of invalidity. Although group 
differences, in and of themselves, do not call 
into question the validity of a proposed inter¬ 
pretation, they may increase the salience of 
plausible rival hypotheses that should be 
investigated as part of the validation effort. 
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2. RELIABILITY AND ERRORS OF 
MEASUREMENT 


Background 

A rest, broadly defined, is a set of tasks designed 
to elicit or a scale to describe examinee behavior 
tn a specified domain, or a system for collecting 
samples of an individual's work in a particular 
area. Coupled with the device is a scoring pro¬ 
cedure that enables the examiner to quantify, 
evaluate, and interpret the behavior or work 
samples. Reliability refers to the consistency 
of such measurements when the testing pro¬ 
cedure is repeated on a population of individ¬ 
uals or groups. 

The discussion that follows introduces 
concepts and procedures that may nor be famil¬ 
iar to some readers. It is not expected that the 
brief definitions and explanations presented 
here will be sufficient to enable rhe less sophis¬ 
ticated reader to become adequately conver¬ 
sant with these developments. To achieve a 
better understanding, such readers may need 
to consult more comprehensive treatments 
in the measurement literature. 

The usefulness of behavioral measure¬ 
ments presupposes that individuals and groups 
exhibit some degree of stability in their behav¬ 
ior. However, successive samples of behavior 
from the same person are rarely identical in all 
pertinent respects. An individual's perform¬ 
ances, products, and responses to sets of test 
questions vary in their quality or character 
from one occasion to another, even under 
strictly controlled conditions This variation 
is reflected in the examinee's scores. The caus¬ 
es of this variability arc generally unrelated to 
the purposes of measurement. An examinee 
may try harder, may make luckier guesses, be 
more alert, fed less anxious, or enjoy better 
health on one occasion than another. An 
examinee may have knowledge, experience, or 
understanding that is more relevant to some 
tasks than to others in the domain sampled 
by the test. Some individuals may exhibir less 


variation in their scores than others, but no 
examinee is completely consistent. Because of 
this variation and, in some instances, because 
of subjectivity in the scoring process, an indi¬ 
vidual s obtained score and the average score 
of a group will always reflect at least a small 
amount of measurement error. 

To say that a score indudes a component 
of error implies that there is a hypothetical 
error-free value that characterizes an examinee 
at the time of testing. In classical test theory 
this error-free value is referred ro as the per¬ 
sons true score for the test or measurement 
procedure. It is conceptualized as the hypo¬ 
thetical average score resulting from many 
repetitions of the test or alternate forms of 
the instrument. In statistical terms, the true 
score is a personal parameter and each observed 
score of an examinee is presumed to estimate 
rhis parameter. Under an approach to reliability 
estimation known as generalizability theory, a 
comparable concept is referred to as an exami¬ 
nees universe score. Under item response theory 
(IRT), a closely related concept is called an 
examinees ability or trait parameter, though 
observed scores and trait parameters may be 
stated in different units. The hypothetical dif¬ 
ference between an examinees observed score 
on any particular measurement and the exam¬ 
inee's true or universe score for the procedure 
is called measurement error. 

The definition of what constitutes a 
standardized test or measurement procedure 
has broadened significantly in recent years. At 
one time the cardinal features of most stan¬ 
dardized tests were consistency of the test 
materials from examinee to examinee, close 
adherence to stipulated procedures for test 
administration, and use of prescribed scoring 
rules that could be applied with a high degree 
of consistency. These features were, in fact, 
what made a test “standardized,” and they 
made meaningful norms possible. In employ- 
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ment settings and certification programs, flex¬ 
ible measurement procedures have been in 
use for many years. Individualized oral exami¬ 
nations, simulations, analyses of extended 
case reports, and performance in real-life set¬ 
tings such as clinics are now commonplace. 

In education, however, large-scale testing pro¬ 
grams with a high degree of flexibility in test 
format and administrative procedures are a 
relatively recent development. In some pro¬ 
grams cumulative portfolios of student work 
have been substituted for more traditional 
end-of-year tests of achievement. Other pro¬ 
grams now allow examinees to choose their 
own topics to demonstrate their abilities. Still 
others permit or encourage small groups of 
examinees to work cooperatively in complet¬ 
ing rhe rest. A science examinadon, for exam¬ 
ple, might involve a ream of high school 
students who conduct a study of the sources 
of pollution in local streams and prepare a 
report on their findings. Examinations of 
this kind raise complex issues regarding rhe 
domain represented by the test and about 
the generalizabiliry of individual and group 
scores. Each step toward greater flexibility 
almost inevitably enlarges the scope and mag¬ 
nitude of measurement error. However, it is 
possible that some of rhe resultant sacrifices 
in reliability may reduce construct irrelevance 
or construct underrepresentation in an assess¬ 
ment program 

Characteristics and Implications of 
Measurement Error 

Errors of measurement are generally viewed as 
random and unpredictable. They are concep¬ 
tually distinguished from systematic errors, 
which may also affect performance of individ¬ 
uals or groups, but in a consistent rather than 
a random manner. For example, a sysrematic 
group error would occur as a result of differ¬ 
ences in the difficulty of test forms that have 
not been adequately equated. When one test 
form is less difficult than another, examinees 


who take the easier form may be expected to 
earn a higher average score than those who take 
rhe more difficult form. Such a difference 
would not be considered an error of measure¬ 
ment under most methods of quantifying and 
summarizing error, though generalizability 
theory would permit test form differences to 
be recognized as an error source. 

The systematic factors that may differen¬ 
tially affect the performance of individual test 
takers are not as easily detected or overridden 
as those affecting groups. For example, some 
examinees experience levels of test anxiety 
that severely impair cognitive efficiency. The 
presence of such a condition can sometimes 
be recognized in an examinee, but the effect 
cannot be overcome by statistical adjustments. 
The individual systematic errors are not gen¬ 
erally regarded as an element that contributes 
to unreliability. Rather, they constitute a 
source of construct-irrelevant variance and 
thus may detract from validity. 

Important sources of measurement error 
may be broadly categorized as those rooted 
wirhin the examinees and those external to 
them. Fluctuations in the level of an exam¬ 
inee's motivation, interest, or attention and 
the inconsistent application of skills are dear¬ 
ly internal factors that may lead to score 
inconsistencies. Differences among testing 
sites in their freedom from distractions, the 
random effects of scorer subjectivity, and vari¬ 
ation in scorer standards are examples of 
external factors. The potency and importance 
of any particular source depend on the specif¬ 
ic conditions under which the measures are 
taken, how performances are scored, and the 
interpretations made from the scores. A partic¬ 
ular factor, such as the subjectivity in scoring, 
may be a significant source of measurement 
error in some assessments and a minor con¬ 
sideration in ochers. 

Some changes in scores from one occa¬ 
sion to another, it should be noted, are not 
regarded as error, because they result, in part, 
from an intervention, learning, or maturation 
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that has occurred between the initial and final 
measures. The difference within an individual 
indicates, to some extent, the effects of the 
intervention or the extent of growth. In such 
settings, change per sc constitutes the phe¬ 
nomenon of interest. The difference or the 
change score then becomes the measure to 
which reliability pertains. 

Measurement error reduces the useful¬ 
ness of measures. It limits the extent to which 
test results can be generalized beyond the par¬ 
ticulars of a specific application of the meas¬ 
urement process. Therefore, it reduces the 
confidence that can be placed in any single 
measurement. Because random measurement 
errors are inconsistent and unpredictable, 
they cannot be removed from observed 
scores. However, their aggregate magnitude 
can be summarized in several ways, as dis¬ 
cussed below. 

Summarizing Reliability Oata 

Information about measurement error is 
essential to the proper evaluation and use of 
an instrument. This Is true whether the meas¬ 
ure is based on the responses to a specific set 
of questions, a portfolio of work samples, the 
performance of a task, or the creation of an 
original product. The ideal approach to the 
study of reliability entails independent repli¬ 
cation of the entire measurement process. 
However, only a rough or partial approxima¬ 
tion of such replication is possible in many 
testing situations, and investigation of measure¬ 
ment error may require special studies char depart 
from routine testing procedures. Nevertheless, 
it should be the goal of test developers to 
investigate test reliability as fully as practical 
considerations permit. No test developer is 
exempt from this responsibility. 

The critical information on reliability 
includes the identification of the major 
sources of error, summary statistics bearing 
on the size of such errors, and the degree of 
generalizabiliry of scores across alternate 


forms, scorers, administrations, ot other rele¬ 
vant dimensions. It also includes a description 
of the examinee population to whom the 
foregoing data apply, as the data may accu¬ 
rately reflect what is true of one population 
but misrepresent what is true of another. For 
example, a given reliability coefficient or esti¬ 
mated standard error derived from scores of a 
nationally representative sample may differ 
significantly from that obtained for a more 
homogeneous sample drawn from one gen¬ 
der, one ethnic group, or one community. 

Reliability information may be reported 
in terms of variances or standard deviations of 
measurement errors, in terms of one or more 
coefficients, or in terms of IRT-based test 
information functions. The standard error of 
measurement is the standard deviation of a 
hypothetical distribution of measurement 
errors that arises when a given population is 
assessed via a particular test or procedure. 
The overall variance of measurement errors is 
actually a weighted average of the values that 
hold at various true score levels. The variance 
at a particular level is called a conditional 
error variance and its square root a conditional 
standard error. Traditionally, three broad cate¬ 
gories of reliability coefficients have been rec¬ 
ognized: (a) coefficients derived from the 
administration of parallel forms in independent 
testing sessions (alternate-form coefficients); 
(b) coefficients obtained by administration 
of the same instrument on separate occa¬ 
sions (tesc-tetest or stability coefficients); 
and (c) coefficients based on the relation¬ 
ships among scores derived from individual 
items or subsets of the items within a test, 
all data accruing from a single administra¬ 
tion (internal consistency coefficients). 
Where test scoring involves a high level of 
judgment, indexes of scorer consistency are 
commonly obtained. With the development 
of generalizability theory, the foregoing 
three categories may now be seen as special 
cases of a more general classification: gener¬ 
alizability coefficients. 
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Like traditional reliability coefficients, a 
generalizability coefficient is defined as rbe ratio 
of true or universe score variance to observed 
score variance. Unlike traditional approaches 
to the study of reliability, however, generaliz- 
ability theory permits the researcher to specify 
and estimate the various components of true 
score variance, error variance, and observed 
score variance, Estimation is typically accom¬ 
plished by rhe application of the techniques 
of analysis of variance. Of special interest are 
the separate numerical estimates of the com¬ 
ponents of overall error variance. Such esti¬ 
mates permit examination of the contribution 
of each source of error to the overall measure¬ 
ment process. The generalizability approach 
also makes possible che estimation of coeffi¬ 
cients that apply to a wide variety of potential 
measurement designs. 

The test information function, an impor¬ 
tant result of 1RT, efficiently summarizes how 
well the test discriminates among individuals 
at various levels of the ability or trait being 
assessed. Under the IRT conceptualization, a 
mathematical function called the item charac¬ 
teristic curve or item response function is used 
as a model to represent the increasing propor¬ 
tion of correct responses to an item for groups 
at progressively higher levels of the ability or 
trait being measured. Given an adequate 
database, rhe parameters of the characteristic 
curve of each item in a test can be estimated. 
The test information function can then be 
approximated. This function may be viewed 
as a mathematical statement of the precision 
of measurement at each level of the given 
trait. Precision, in the IRT context, is analo¬ 
gous to the reciprocal of the conditional error 
variance of classical test theory. 

Interpretation of Reliability Data 

In general, reliability coefficients are most useful 
in comparing tests or measurement procedures, 
particularly those that yield scores in different 
units or metrics. However, such comparisons 


are rarely straightforward. Allowance must be 
made for differences in the variability of the 
groups on which rhe coefficients are based, 
the techniques used to obtain the coefficients, 
the sources of error reflected in the coeffi¬ 
cients, and the lengths of the instruments 
being compared in terms of testing time. 

Generalizability coefficients and the 
many coefficients included under the tradi¬ 
tional categories may appear to be inter¬ 
changeable, but some convey quite different 
information from others. A coefficient in any 
given category may encompass errors of 
measurement from a highly restricted per¬ 
spective, a vety broad perspective, or some 
point between these extremes. For example, 
a coefficient may reflect error due to scorer 
inconsistencies but not reflea rhe variation 
that characterizes a succession of examinee 
performances or products. A coefficient may 
reflect only the internal consistency of item 
responses within an instrument and fail to 
reflect measurement error associated with 
day-to-day changes in examinee health, effi¬ 
ciency, or motivation. 

It should not be inferred, however, that 
alternate-form or test-retest coefficients based 
on test administrations several days or weeks 
apart arc always preferable to internal consis¬ 
tency coefficients. For many tests, internal 
consistency coefficients do not differ signifi¬ 
cantly from alternate-form coefficients. Where 
only one form of a test exists, retesting may 
result in an inflated correlation between the 
first and second scores due to idiosyncratic 
features of the test or to examinee recall of 
initial responses. Also, an individual’s status 
on some attributes, such as mood or emo¬ 
tional state, may change significantly in a 
short period of time. In the assessment of 
such constructs the multiple measures that 
give rise to reliability estimates should be 
obtained within the short period in which the 
attribute remains stable. Therefore, for char¬ 
acteristics of this kind an internal consistency 
coefficient may be preferred. 
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The standard error of measurement is 
generally more relevant than the reliability 
coefficient once a measurement procedure has 
been adopted and interpretation of scores has 
become the user’s primary concern. It should 
be noted that standard errors share some of 
the ambiguities which characterize reliability 
coefficients, and estimates may vary in their 
quality. Information about the precision of 
measurement at each of several widely spaced 
score levels—that is, conditional standard 
errors—is usually a valuable supplement to the 
single statistic for all score levels combined. 
Like reliability and generalizability coeffi¬ 
cients, standard errors may reflect variation 
from many sources of error or only a few. 
For most purposes, a more comprehensive 
standard error is more informative than a 
less comprehensive value. However, there 
are many exceptions to this generalization. 
Practical constraints often preclude conduct 
of the kinds of studies that would yield esti¬ 
mates of the preferred standard errors. 

Measurements derived from observations 
of behavior or evaluations of products are espe¬ 
cially sensitive to a variety of error factors. These 
include evaluator biases and idiosyncrasies, 
scoring subjectivity, and intra-examinee factors 
that cause variation from one performance or 
product to another. The methods of general¬ 
izability theory are well suited to the investi¬ 
gation of the reliability of the scores on such 
measures. Estimates of the error variance 
associated with each specific source and with 
the interactions between sources indicate the 
extent to which examinee scores may be gen¬ 
eralized to a population of scorers and to a 
universe of products or performances. 

The interpretations of test scores may be 
broadly categorized as relative or absolute. 
Relative interpretations convey the standing 
of an individual or group within a reference 
population. Absolute interpretations relate the 
status of an individual or group to defined 
standards. These standards may originate in 
empirical data for one or more populations or 


be based entirely on authoritative judgment. 
Different values of the standard error apply 
to the two types of interpretations. 

The test information function can be 
perceived an alternative to traditional indices 
of measurement precision, but there are 
important distinctions that should be noted. 
Standard errors under classical test theory can 
be derived by several different approaches. 
These yield similar, but not identical, results. 
More significantly, standard errors, like relia¬ 
bility coefficients, may reflect a broad con¬ 
figuration of error factors or a restricted 
configuration, depending on the design of the 
reliability study. Test information functions, 
on the other hand, are limited to the restrict¬ 
ed definition of measurement error that is 
associated with internal consistency reliabili¬ 
ties. In addition, under IRT several different 
mathematical models have been proposed and 
accepted as the basic form of the item charac¬ 
teristic curve. Adoption of one model rather 
than another can have a material effect on the 
derived test information function. 

A final consideration has significant impli¬ 
cations for both IRT and classical approaches 
to quantification of test score precision. It is 
this: Indices of precision depend on the scale 
in which they are reported. An index stated 
in terms of raw scores ot the trait level esti¬ 
mates of IRT may convey a radically different 
perception of reliability than the same index 
restated in terms of derived scores. This same 
contrast may hold for conditional standard 
errors. In terms of the basic score scale, preci¬ 
sion may appear to be high at one score level, 
low ar another. Bur when the conditional 
standard errors are restated in units of derived 
scores, such as grade equivalents or standard 
scores, quire different trends in comparative 
precision may emerge. Therefore, measure¬ 
ment precision under both theories very 
strongly depends on the scale in which test 
scores are reported and interpreted. 

Precision and consistency in measure¬ 
ment are always desirable. However, the need 
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for precision increases as the consequences of 
decisions and interpretations grow in impor¬ 
tance. If a decision can and will be corrobo¬ 
rated by information from other sources or if 
an erroneous initial decision can be quickly 
corrected, scores with modest reliability may 
suffice. But if a test score leads to a decision 
that is not easily reversed, such as rejection or 
admission of a candidate to a professional 
school or the decision by a jury that a serious 
injury was sustained, the need for a high degree 
of precision is much greater. 

Where the purpose of measurement is 
classification, some measurement errors arc 
more serious than others. An individual who 
is far above or far below the value established 
for pass/fail or for eligibility for a special pro¬ 
gram can be mismeasured without serious 
consequences. Mismeasurement of examinees 
whose true scores are close to the cut score is 
a more serious concern. The techniques used 
to quantify reliability should recognize these 
circumstances. This can be done by reporting 
the conditional standard error in the vicinity 
of the critical value. 

Some authorities have proposed that a 
semantic distinction be made between “relia¬ 
bility of scores" and "degree of agreement in 
classification." The former term would be 
reserved for analysis of score variation under 
repeated measurement. The term classification 
consistency or inter-rater agreement, rather than 
reliability, would be used in discussions of 
consistency of classification. Adoption of such 
usage would make it clear that the impor¬ 
tance of an error of any given size depends on 
the proximity of the examinees score to the 
cut score. However, it should be recognized 
rhat the degree of consistency or agreement in 
examinee classification is specific to the cut 
score employed and its location within the 
score distribution. 

Average scores of groups, when interpret¬ 
ed as measures of program effectiveness, 
involve error factors that are not identical to 
those that operate ar the individual level. For 


large groups, the positive and negative meas¬ 
urement errors of individuals may average out 
almost completely in group means. However, 
the sampling errors associated with the ran¬ 
dom sampling of persons who are tested for 
purposes of program evaluation are still pres¬ 
ent. This component of the variation in the 
mean achievement of school classes from year 
to year or in the average expressed satisfaction 
of successive samples of the clients of a pro¬ 
gram may constitute a potent source of error 
in program evaluations. It can be a significant 
source of error in inferences about programs 
even if there is a high degree of precision in 
individual test scores. Therefore, when an 
instrument is used to make group judgments, 
reliability data must bear directly on rhe 
interpretations specific to groups. Standard 
errors appropriate to individual scores are not 
appropriate measures of the precision of group 
averages. A more appropriate statistic is the 
standard error of the observed score means. 
Generalizability theory can provide more 
refined indices when the sources of measure¬ 
ment error are numerous and complex. 

Typically, developers and distributors of 
tests have primary responsibility for obtain¬ 
ing and reporting evidence of reliability or 
test information functions. The user must 
have such data to make an informed choice 
among alternative measurement approaches 
and will generally be unable to conduct relia¬ 
bility studies prior to operational use of an 
instrument. In some instances, however, local 
users of a test or procedure must accept at 
least partial responsibility for documenting 
the precision of measurement. This obliga¬ 
tion holds when one of the primary purposes 
of measurement is to rank or classify exam¬ 
inees within the local population. It also 
holds when users must rely on local scorers 
who are trained to use the scoring rubrics 
provided by the test developer. In such set¬ 
tings, local factors may materially affect the 
magnitude of error variance and observed 
score variance. Therefore, the reliability of 
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scores may differ appreciably from that report¬ 
ed by the developer. 

The reporting of reliability coefficients 
alone, with little detail regarding the methods 
used to estimate the coefficient, the narure of 
the group from which the data were derived, 
and the conditions under which the data were 
obtained constitutes inadequate documentation. 
General statements to the effect that a test is 
“reliable" or that it is “sufficiently reliable to 
permit interpretations of individual scores" are 
rarely, if ever, acceptable. It is the user who must 
take responsibility for determining whether or 
not scores are sufficiendy trustworthy to justify 
anticipated uses and interpretations. Of course, 
test constructors and publishers are obligated 
to provide sufficient data to make informed 
judgments possible. 

As the foregoing comments emphasize, 
there is no single, preferred approach to 
quantification of reliability. No single index 
adequately conveys all of the relevant facts. 
No one method of investigation is optimal in 
all situations, nor is the test developer limited 
to a single approach for any instrument. The 
choice of estimation techniques and the mini- 
mum acceptable level for any index remain a 
matter of professional judgment. 

Although reliability is discussed here as an 
independent characteristic of test scores, it should 
be recognized that the level of reliability of scores 
has implications for the validity of score inter¬ 
pretations. Reliability data ultimately bear on 
the repeatability of the behavior elicited by the 
test and the consistency of the resultant scores. 
The data also bear on the consistency of classi¬ 
fications of individuals derived from the scores. 
To the extent that scores reflect random errors 
of measurement, their potential for accurate 
prediction of criteria, for beneficial examinee 
diagnosis, and for wise decision making is lim¬ 
ited. Relatively unreliable scores, in conjunction 
with other convergent information, may some¬ 
times be of value to a test user, but the level of 
a scores reliability places limits on its unique 
contribution to validity for all purposes 


Standard 2.1 

For each total score, subscore, or combina¬ 
tion of scores that is to be interpreted, esti¬ 
mates of relevant reliabilities and standard 
errors of measurement or test informadon 
functions should be reported. 

Comment: It is not sufficient to report esti¬ 
mates of reliabilities and standard errors of 
measurement only for tocal scores when sub¬ 
scores are also interpreted. The form-to-form 
and day-to-day consistency of total scores on 
a test may be acceptably high, yet subscores 
may have unacceptably low reliability. For all 
scores to be interpreted, users should be sup¬ 
plied with reliability data in enough detail to 
judge whether scores are precise enough for 
the users’ intended interpretations. Composites 
formed from selected subtests within a test 
battery are frequently proposed for predictive 
and diagnostic purposes. Users need informa¬ 
tion about the reliability of such composites. 

Standard 2.2 

The standard error of measurement, both 
overall and conditional (if relevant), should 
be reported both in raw score or original 
scale units and in units of each derived score 
recommended for use in test interpretation. 

Comment. The most common derived scores 
include standard scores, grade or age equiva¬ 
lents, and percentile ranks. Because raw scores 
on norm-referenced tests are only rarely inter¬ 
preted directly, standard errors in derived 
score units are more helpful to the typical test 
user. A confidence interval for an examinees 
true score, universe score, or percentile rank 
serves much the same purpose as a standard 
error and can be used as an alternative approach 
to convey reliability information. The impli¬ 
cations of the standard error of measurement 
are especially important in situations where 
decisions cannot be postponed and corrobo¬ 
rative sources of information arc limited. 
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Standard 2.3 

When test interpretation emphasizes differ¬ 
ences between two observed scores of an 
individual or two averages of a group, relia¬ 
bility data, including standard errors, should 
be provided for such differences. 

Comment: Observed score differences are used 
for a variety of purposes. Achievement gains 
are frequently the subject of inferences for 
groups as well as individuals. Differences 
between verbal and performance scores of 
intelligence and scholastic ability tests are 
often employed in the diagnosis of cognitive 
impairment and learning problems. Psycho¬ 
diagnostic inferences are frequently drawn 
from the differences between subtest scores. 
Aptitude and achievement batteries, interest 
inventories, and personality assessments are 
commonly used to identify and quantify the 
relative strengths and weaknesses or the pat¬ 
tern of trait levels of an examinee. When the 
interpretation of test scores centers on the 
peaks and valleys in the examinees test score 
profile, the reliability of score differences for 
all pairs of scores is critical. 

Standard 2.4 

Each method of quantifying the precision 
or consistency of scores should be described 
clearly and expressed in terms of statistics 
appropriate to the method. The sampling 
procedures used to select examinees for relia¬ 
bility analyses and descriptive statistics on 
these samples should be repotted. 

Comment: Information on the method of 
subject selection, sample sizes, means, stan¬ 
dard deviations, and demographic characteris¬ 
tics of the groups helps users judge the extent 
to which reported data apply to their own 
examinee populations. If the test-rccest or 
alternate-form approach is used, the interval 
between tescings should be indicated. Because 
there are many ways of estimating reliability, 


each influenced by different sources of meas¬ 
urement error, it is unacceptable to say simply, 
“The reliability of test X is .90." A better 
statement would be, “The reliability coeffi¬ 
cient of .90 reported for scores on test X was 
obtained by correlating scores from forms A 
and B administered on successive days. The 
data were based on a sample of400 1 Oth-grade 
students from five middle-class suburban 
schools in New York State. The demographic 
breakdown of this group was as follows: ....“ 

Standard 2.5 

A reliability coefficient or standard error of 
measurement based on one approach should 
not be interpreted as interchangeable with 
another derived by a different technique 
unless their implicit definitions of measure¬ 
ment error are equivalenr. 

Comment: Internal consistency, alternate- 
form, test-retest, and generalizability coeffi¬ 
cients should not be considered equivalent, as 
each may incorporate a unique definition of 
measurement error. Error variances derived 
via item response theory may not be equiva¬ 
lent to error variances estimated via other 
approaches. Test developers should indicate 
the sources.of error that are reflected in or 
ignored by the reported reliability indices, 

Standard 2.6 

If reliability coefficients are adjusted for restric¬ 
tion of range or variability, the adjustment pro¬ 
cedure and both the adjusted and unadjusted 
coefficients should be reported. The standard 
deviations of the group actually tested and of 
the target population, as well as the rationale 
for the adjustment, should be presented. 

Comment: Application of a correction for 
restriction in variability presumes that the 
available sample is not representative of the 
test-taker population to which users might be 
expected to generalize. The rationale for the 
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correction should consider the appropriate¬ 
ness of such a generalization. Adjustment for¬ 
mulas that presume constancy in the standard 
error across score levels should not be used 
unless constancy can be defended. 

Standard 2.7 

When subsets of items within a test are dic¬ 
tated by the test specifications and can be 
presumed to measure partially independent 
traits or abilities, reliability estimation pro¬ 
cedures should recognize the muitifactor 
character of the instrument. 

Comment: The total score on a test that is 
clearly multifactor in nature should be treated 
as a composite score. If an internal consistency 
estimate of total score reliability is obtained 
by the split-halves procedure, the halves 
should be parallel in content and statistical 
characteristics. Stratified coefficient alpha 
should be used rather than the more familiar 
nonstratified coeffictenr 

Standard 2.8 

Test users should be informed about the 
degree to which rate of work may affect 
examinee performance. 

Comment: It is not possible to state, in general, 
whether reliability coefficients will increase or 
decrease when rate of work becomes an impor¬ 
tant source of systematic variance. Rate of work, 
as an examinee trait, may be more stable or 
less stable from occasion to occasion than che 
other factors the test is designed to measure. 
Because speededness has differential effects on 
various estimates, information on speededness 
is helpful in interpreting reported coefficients. 

The importance of the speed factor can 
sometimes be inferred from analyses of item 
responses and from observations by examiners 
during test administrations conducted for 
reliability analyses. The distribution of "last 
item attempted” and increases in the frequen¬ 


cy of omitted responses toward the end of a 
test are also highly informative, though not 
conclusive, evidence regarding speededness. A 
decline in the proportion of correct responses, 
beyond that attributable to increasing item 
difficulty, may indicate chat some examinees 
were responding randomly. With computer- 
administered tests, abnormally fast item response 
times, particularly toward the end of the test, 
may also suggest that examinees were respond¬ 
ing randomly- In the case of constructcd- 
response exercises, including essay questions, 
the completeness of the responses may sug¬ 
gest that time constraints had little effect on 
early items but a significant effect on later 
items. Introduction of a speed factor into 
what might otherwise be a power test may 
have a marked effect on alternate-form and 
test-retest reliabilities. A shift from a paper- 
and-pencil format to a computer-adminis¬ 
tered format may affect test speededness 

Standard 2.9 

When a test is designed to reflect rate of 
work, reliability should be estimated by the 
alternate-form or test-retest approach, using 
separately timed administrations. 

Comment: Split-half coefficients based on 
separate scores from the odd-numbered and 
even-numbered items are known to yield 
inflated estimates of reliability for highly 
speeded tests. Coefficient alpha and other 
internal consistency coefficients may also be 
biased, though the size of the bias is not as 
clear as that for the split-halves coefficient. 

Standard 2.10 

When subjective judgment enters into test 
scoring, evidence should be provided on both 
inter-rater consistency in scoring and within- 
examinee consistency over repeated measure¬ 
ments. A clear distinction should be made 
among reliability data based on (a) independ¬ 
ent panels of raters scoring the same perfbtm- 
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ances or products, (b) a single panel scoring 
successive performances or new products, and 
(c) independent panels scoring successive per¬ 
formances or new products. 

Comment, Task-to-task variations in the quality 
of an examinees performance and rarer-to-rarer 
inconsistencies in scoring represent independ¬ 
ent sources of measurement error. Reports of 
reliability studies should make dear which of 
these sources are reflected in rhe data. Where 
feasible, the error variances arising from each 
source should be estimated. Generalizability 
studies and variance component analyses are 
especially helpful in this regard. These analy¬ 
ses can provide separate error variance esti¬ 
mates for tasks within examinees, for judges, 
and for occasions within the time period of 
trail stability. Information should be provided 
on the qualifications of the judges used in 
reliability studies. 

Inter-rater or inter-observer agreement 
may be particularly important for ratings and 
observational data chat involve subtle discrimi¬ 
nations. It should be noted, however, that 
when raters evaluate positively correlated 
characteristics, a favorable ot unfavorable 
assessment of one trait may color their opin¬ 
ions of other traits. Moreover, high inter-rater 
consistency does not imply high examinee 
consistency from task to task. Therefore, 
internal consistency within raters and inter- 
rater agreement do nor guarantee high relia¬ 
bility of examinee scores. 

Standard 2.11 

If there are generally accepted theoretical or 
empirical reasons for expecting that reliabili¬ 
ty coefficients, standard errors of measure¬ 
ment, or test information functions will 
differ substantially for various subpopuia- 
tions, publishers should provide reliability 
data as soon as feasible for each major popu¬ 
lation for which the test is recommended. 


Comment: If test score interpretation involves 
inferences within subpopulations as well as 
within the general population, reliability data 
should be provided for both the subpopulations 
and the general population. Test users who 
work exclusively with a specific cultural group 
or with individuals who have a particular dis¬ 
ability would benefit from an estimate of the 
standard error for such a subpopulation. Some 
groups of test takers—pre-school children, for 
example—rend to respond to test stimuli in a 
less consistent fashion than do older children. 

Standard 2.12 

If a test is proposed for use in several grades 
or over a range of chronological age groups 
and if separate norms are provided for each 
grade or each age group, reliability data should 
be provided for each age or grade population, 
not solely for all grades or ages combined. 

Comment: A reliability coefficient based on a 
sample of examinees spanning several grades 
or a broad range of ages in which average 
scores are steadily increasing will generally 
give a spuriously inflated impression of relia¬ 
bility. When a test is intended to discriminate 
within age or grade populations, reliability 
coefficients and standard errors should be 
reported separately for each population. 

Standard 2.13 

If local scorers are employed to apply gener¬ 
al scoring rules and principles specified by 
the test developer, local reliability data should 
be gathered and reported by local authorities 
when adequate size samples are available. 

Comment For example, many statewide test¬ 
ing programs depend on local scoring of 
essays, constructed-response exercises, and 
performance tests. Reliability analyses bear on 
the possibility that additional training of scor¬ 
ers is needed and, hence, should be an inte¬ 
gral part of program monitoring. 
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Standard 2.14 

Conditional standard errors of measurement 
should be reported at several score levels if 
constancy cannot be assumed. Where cut scores 
are specified for selection ot classification, the 
standard errors of measurement should be 
reported in the vicinity of each cut score. 

Comment: Estimation of conditional standard 
errors is usually feasible even with the sample 
sizes that are typically used for reliability 
analyses. If it is assumed rhat the standard 
error is constant over a broad range of score 
levels, the rationale for this assumption should 
be presented. 

Standard 2.15 

When a test or combination of measures is 
used to make categorical decisions, estimates 
should be provided of the percentage of 
examinees who would be classified in the 
same way on two applications of the proce¬ 
dure, using the same form or alternate forms 
of the instrument. 

Comment: When a test or composite is used to 
make categorical decisions, such as pass/fail, 
the standard error of measurement at or near 
chc cut score has important implications for the 
trustworthiness of these decisions. However, 
the standard error cannot be translated into 
the expected percentage of consistent deci¬ 
sions unless assumptions are made about the 
form of the distributions of measurement 
errors and true scores. It is preferable thar this 
percentage be estimated directly through the 
use of a repeated-measurements approach if 
consistent with the requirements of test secu¬ 
rity and if adequate samples are available. 

Standard 2.16 

In some testing situations, the items vary from 
examinee to examinee—through random selec¬ 
tion from an extensive item pool or application 


of algorithms based on the examinee’s level of 
performance on previous items or preferences 
with respect to item difficulty In this type of 
resting, the preferred approach to reliability 
estimation is one based on successive adminis¬ 
trations of the test under conditions similar to 
those prevailing in operational test use. 

Comment: Varying the set of items presented 
to each examinee is an acceptable procedure 
in some settings. If this approach is used, reli¬ 
ability data should be appropriate to this pro¬ 
cedure. Estimates of standard errors of ability 
scores can be computed through the use of 
IRT and reported routinely as part of the 
adaptive testing procedure. However, those 
estimates are not an adequate substitute for 
estimates based on successive administrations 
of the adaptive test, nor do they bear on the 
issue of stability over short intervals. IRT esti¬ 
mates are contingent on the adequacy of both 
the item parameter estimates and the item res¬ 
ponse models adopted in the theory Estimates 
of reliabilities and standard errors of measure¬ 
ment based on the administration and analysis 
of alternate forms of an adaptive test reflect 
errors associated with the entire measurement 
process. The alternate-form estimates provide 
an independent check on the magnitude of 
the errors of measurement specific to the 
adaptive feature of the testing procedure. 

Standard 2.17 

When a test is available in both long and short 
versions, reliability data should be reported for 
scores on each version, preferably based on an 
independent administration of each. 

Comment: Some tests and test batteries are 
published in both a “full-length” version and 
a “survey" or “short" version. In many appli¬ 
cations the Spearman-Brown formula will sat¬ 
isfactorily approximate the reliability of one of 
these from data based on the other. However, 
context effects are commonplace in tests of 
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maximum performance. Also, the short ver¬ 
sion of a standardized test often comprises a 
nonrandom sample of items from the full- 
length version. Therefore, the shorter version 
may be more reliable or less reliable than the 
Spearman-Brown projections from the foll- 
lengrh version. The reliability of scores on 
each version is best evaluated through an 
independent administration of each, using 
the designated time limits. 

Standard 2.18 

When significant variations are permitted in 
test administration procedures, separate reli¬ 
ability analyses should be provided for scores 
produced under each major variation if ade¬ 
quate sample sizes are available. 

Comment: To accommodate examinees with 
disabilities, test publishers might authorize 
modifications in the procedures and time 
limits that are specified for the administration 
of the paper-and-pencil edition of a test In 
some cases, modified editions of the test itself 
may be provided. For example, tape-recorded 
versions for use in a group setting or with 
individual equipment may be used to test 
examinees who exhibit reading disabilities or 
attention deficits. If such modifications can 
be employed with test rakers who are not dis¬ 
abled, insights can be gained regarding the 
possible effects on test scores of these non¬ 
standard administrations. 

Standard 2.19 

When average test scores for groups are used 
in program evaluations, the groups tested 
should generally be regarded as a sample 
from a larger population, even if all exam¬ 
inees available at the time of measurement are 
tested. In such cases the standard error of the 
group mean should be reported, as it reflects 
variability due to sampling of examinees as 
well as variability due to measurement error. 


Comment: The graduating seniors of a liberal 
arts college, the current clients of a social 
service agency, and analogous groups exposed 
to a program of interest typically constitute a 
sample in a longitudinal sense. Presumably, 
comparable groups from the same population 
will recur in future years, given static condi¬ 
tions. The factors leading to uncertainty in 
conclusions about program effectiveness arise 
from the sampling of persons as well as meas¬ 
urement error. Therefore, the standard error 
of the mean observed score, reflecting varia¬ 
tion in both true scores and measurement 
errors, represents a more realistic standard 
error in this setting. Even this value may 
underestimate the variability of group means 
over time. In many settings, the static condi¬ 
tions assumed under random sampling of 
persons do not prevail. 

Standard 2.20 

When the purpose of testing is to measure the 
performance of groups rather than individuals, 
a procedure frequently used is to assign a small 
subset of items to each of many subsamples of 
examinees. Data are aggregated across sub¬ 
samples and item subsets to obtain a measure 
of group performance. When such procedures 
are used for program evaluation or population 
descriptions, reliability analyses must take the 
sampling scheme into account. 

Comment: This type of measurement program 
is termed matrix sampling. It is designed to 
reduce the time demanded of individual 
examinees and to increase the total number of 
items on which data are obtained. This test¬ 
ing approach provides the same type of infor¬ 
mation about group performances that would 
accrue if all examinees could respond to all 
exercises in the item pool. Reliability statistics 
must be appropriate to the sampling plan 
used with respect to examinees and items. 
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Background 

Test development is che process of producing 
a measure of some aspect of an individual’s 
knowledge, skill, ability, interests, attitudes, 
or other characteristics by developing items 
and combining them to form a test, accord¬ 
ing to a specified plan. Test development is 
guided by the stated purpose(s) of the test 
and the intended inferences to be made from 
the test scores. The test development process 
involves consideration of content, format, the 
context in which the rest will be used, and 
the potential consequences of using the test. 
Test development also includes specifying 
conditions for administering the test, deter¬ 
mining procedures for scoring the test per¬ 
formance, and reporting the scores to test 
takers and test users. This chapter focuses pri¬ 
marily on the following aspects of test devel¬ 
opment: stating the purpose(s) of the test, 
defining a framework for the test, developing 
test specifications, developing and evaluating 
items and their associated scoring procedures, 
assembling the test, and revising the test. The 
first section describes the test development 
process that begins with a statement of the 
purpose(s) of the tesr and culminates with 
the assembly of the test. The second section 
addresses several special considerations in test 
development, including considerations in 
delineating the test framework and in devel¬ 
oping performance assessments. The chapter 
concludes with a discussion on tesr revision. 
Issues bearing on validity, reliability, and fair¬ 
ness are interwoven within the stages of test 
development. Each of these topics is addressed 
comprehensively in other chapters of the 
Standards-, validity in chapter 1, reliability in 
chapter 2, and aspects of fairness in chapters 
7, 8, 9, and 10. Additional material on test 
administration and scoring, and on reporting 
scores and results, is provided in chapter 5. 
Chapter 4 discusses score scales, and the focus 
of chapter 6 is test documents. 


Test Development 

The process of developing educational and psy¬ 
chological tests commonly begins with a state¬ 
ment of the purpose(s) of the test and the 
construct or content domain to be measured. 
Tests of the same construct or domain can dif¬ 
fer in important ways, because a number of 
decisions must be made as the test is developed. 
It ts helpful to consider the four phases leading 
from the original statement of purpose(s) to the 
final product: (a) delineation of the purpose(s) 
of the test and the scope of the construct or the 
extent of the domain to be measured; (b) devel¬ 
opment and evaluation of the test specifica¬ 
tions; (c) development, field testing, evaluation, 
and selection of the items and scoring guides 
and procedures; and (d) assembly and evalua¬ 
tion of the test for operational use. What fol¬ 
lows is a description of typical test development 
procedures, though there may be sound reasons 
thar some of these steps ate followed in some 
settings and not in others. 

The first step is to extend the original 
statement of purpose(s), and the construct or 
content domain being considered, into a frame¬ 
work for the test that describes the exrent of 
the domain, or the scope of the construct to 
be measured. The test framework, therefore, 
delineates the aspects (e.g., content, skills, 
processes, and diagnostic features) of the con¬ 
struct or domain to be measured. For example, 
"Does eighth-grade mathematics include 
algebra?” “Docs verbal ability include text 
comprehension as well as vocabulary?" “Does 
self-esteem include both feelings and acts?” 
The delineation of the test framework can be 
guided by theory or an analysis of the content 
domain or job requirements as in the case of 
many licensing and employment tests. The test 
framework serves as a guide to subsequent test 
evaluation. The chapter on validity provides a 
more thorough discussion of the relationships 
among the construct or content domain, the 
test framework, and the purpose(s) of the test. 
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Once decisions have been made abouc 
what chc rest is to measure, and what irs scores 
are intended to convey, the next step is to 
design the test by establishing test specifica¬ 
tions. The test specifications delineate the for¬ 
mat of items, tasks, or questions; the response 
formar or conditions for responding; and the 
type of scoring procedures, The specifications 
may indicate the desired psychometric prop¬ 
erties of items, such as difficulty and discrimi¬ 
nation, as well as the desired test properties 
such as cest difficulty, inter-item correlations, 
and reliability. The test specifications may 
also include such factors as time restrictions, 
characteristics of the intended population of 
test takers, and procedures for administration. 
All subsequent test development activities are 
guided by the test specifications. 

Test specifications will include, at least 
implicitly, an indication of whether rhe test 
scores will be primarily norm-referenced or 
criterion-referenced. When scores are norm 
referenced, relative score interpretations are of 
primary interest. A score for an individual or 
for a definable group is ranked within one or 
more distributions of scores or compared to 
the average performance of test takers for var¬ 
ious reference populations (e.g., based on age, 
grade, diagnostic category, or job classifica¬ 
tion). When scores are criterion-referenced, 
absolute score interpretations are of primary 
interest. The meaning of such scores does not 
depend on rank information. Rather, the test 
score conveys directly a level of competence 
in some defined criterion domain. Both rela¬ 
tive and absolute interpretations are often 
used with a given test, but chc test developer 
determines which approach is most relevant 
for that test. 

The nature of the item and response for¬ 
mats that may be specified depends on the 
purposes of the test and the defined domain 
of the test. Selected-response formats, such as 
multiple-choice items, are suitable for many 
purposes of testing. The test specifications 
indicate how many alternatives are to be used 


for each icem. Other purposes may be more 
effectively served by a short constructed-response 
format. Short-answer items require a response 
of no mote than a few words. Extended-response 
formats require the test taker to write a more 
extensive response of one or more sentences 
or paragraphs. Performance assessments often 
seek to emulate the context or conditions in 
which the intended knowledge or skills are 
actually applied. One type of performance 
assessment, for example, is the standardized 
job or work sample. A task is presented to the 
test taker in a standardized format under 
standardized conditions. Job or work samples 
might include, for example, the assessment of 
a practitioners ability to make an accurate diag¬ 
nosis and recommend treatment for a defined 
condition, a manager's ability to articulate goals 
for an organization, or a students proficiency 
in performing a science laboratory experiment. 

Ail types of items require some indica¬ 
tion of how to score the responses. For select¬ 
ed-response items, one alternative is considered 
the correct response in some testing programs. 
In other resting programs, the alternatives may 
be weighted differentially. For short-answer 
items, a list of acceptable alternatives may 
suffice; extended-response items need more 
detailed rules fot scoring, sometimes called 
scoring rubrics. Scoring rubrics specify the crite¬ 
ria for evaluating performance and may vary in 
the degree of judgment entailed, in the number 
of score levels, and in other ways. It is com¬ 
mon practice for tesr developers to provide 
scorers with examples of performances at each 
of the score levels to help clarify the criteria. 

For extended-response items, including 
performance tasks, two major types of scoring 
procedures are used; analytic and holistic. Both 
of the procedures require explicit performance 
criteria that reflea the test framework- However, 
the approaches differ in the degree of detail 
provided in the evaluation report. Under the 
analytic scoring procedure, each critical 
dimension of the performance criteria is judged 
independently, and separate scores are obtained 
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for each of these dimensions in addition to 
an overall score. Under the holistic scoring 
procedure, the same performance criteria may 
implicitly be considered, but only one overall 
score is provided. Because the analytic proce¬ 
dure provides information on a number of 
critical dimensions, it potentially provides valu¬ 
able Information for diagnostic purposes and 
lends itself to evaluating strengths and weak¬ 
nesses of test takers. In contrast, the holistic 
procedure may be preferable when an overall 
judgment is desired and when the skills being 
assessed are complex and highly interrelated. 
Regardless of the type of scoring procedure, 
designing the items and developing the scoring 
rubrics and procedures is an integrated process. 

A participatory approach may be used in 
the design of items, scoring rubrics, and some¬ 
times che scoring process itself. Many interested 
persons (e.g., practitioners, teachers) may be 
involved in developing items and scoring rubrics, 
and/or evaluating the subsequent performan¬ 
ces. If a participatory approach is used, partici¬ 
pants' knowledge about the domain being 
assessed and their ability to apply the scoring 
rubrics are of critical importance. Equally 
important, for those involved in developing 
tests and evaluating performances, is their 
familiarity with the nature of the population 
being tested. Relevant characteristics of the 
population being tested may include the typi¬ 
cal range of expected skill levels, their famil¬ 
iarity with the response modes required of 
them, and the primary language they use. 

The test developer usually assembles an 
item pool that consists of a larger set of items 
than what is required by the test specifications. 
This allows for the test developer to select 
a set of items for the test that meet the rest 
specifications. The quality of the items is 
usually ascertained through item review pro¬ 
cedures and pilot testing. Items are reviewed 
for content quality, clarity and lack of ambi¬ 
guity. Items sometimes are reviewed for sensi¬ 
tivity to gender or cultural issues. An attempt 
is generally made to avoid words and topics 


that may offend or otherwise disturb some 
test takers, if less offensive material is equally 
useful. Often, a field test is developed and 
administered to a group of test takers who are 
somewhat representative of the target popula¬ 
tion for the test. The field tesc helps deter¬ 
mine some of the psychometric properties of 
the test items, such as an item’s difficulty and 
ability to discriminate among test takers of 
different standing on the scale. Ongoing test¬ 
ing programs often pretest items by inserting 
them into existing tests. Those items are not 
used in obtaining test scores of the test takers, 
but the item responses provide useful data for 
test development. 

The next step in test development is to 
assemble items into a test or to identify an 
item pool for an adaptive test. The test devel¬ 
oper is responsible for ensuring that the items 
selected for the test meet the requirements of 
the test specifications. Depending upon the 
purpose(s) of the test, relevant considerations 
in item selection may include the content 
quality and scope, the weighting of items and 
subdomains, and the appropriateness of the 
items selected for the intended population of 
test takers. Often test developers will specify 
the distribution of psychometric indices of 
the items to be included in the test. For 
example, the specified distribution ofitem 
difficulty indices for a selection test would 
differ from the distribution specified for a 
general achievement test. When psychometric 
indices of the items are estimated using item 
response theory (IRT), the fit of the model 
to the data is also evaluated. This is accom¬ 
plished by evaluating the extent to which the 
assumptions underlying the item response 
model (e.g., unidimensionality, local inde¬ 
pendence, speededness, and equality of slope 
parameters) are satisfied. 

The test developer is also responsible for 
ensuring that the scoring procedures are con¬ 
sistent with the purpose(s) of the test and 
facilitate meaningful score interpretation. The 
nature of the intended scote interpretations 
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will determine die importance of psychometric 
characteristics of items in the test construction 
process. For example, indices of item difficulty 
and discrimination, and inter-item correlations, 
may be particularly important when relative 
score interpretations are intended. In the case 
of relative score interpretations, good discrim¬ 
ination among test takers at all points along 
the construct continuum is desirable. It is 
important, however, that the test specifica¬ 
tions are not compromised when optimizing 
the distribution of these indices. In the case 
of absolute score interpretations, different cri¬ 
teria apply. In this case, the extent to which 
the relevant domain has been adequately rep¬ 
resented is important even if many of the 
items are relatively easy or nondiscriminating 
within a relevant population. It is important, 
however, to assure the quality of the content 
of relatively easy or nondiscriminating items. 
If cut scores are necessary for score interpreta¬ 
tion in criterion-referenced programs, the level 
of item discrimination constitutes critical 
information primarily in the vicinity of the 
cut scores. Because of these differences in test 
development procedures, tests designed to 
facilitate one type of interpretation function 
less effectively for other types of interpretation. 
Given appropriate rest design and supporting 
evidence, however, scores arising from some 
norm-referenced programs may provide rea¬ 
sonable absolute score interpretations and 
scores arising from some criterion-refer¬ 
enced programs may provide reasonable rela¬ 
tive score interpretations. 

When evaluating the quality of the items 
in the item pool and the test itself, test devel¬ 
opers often conduct studies of differential 
item functioning (see chapter 7). Differential 
item functioning is said to exist when test 
takers of approximately equal ability on the 
targeted construct or content domain differ 
in their responses to an item according to their 
group membership. In Theory, the ultimate 
goal of such studies is to identify construct- 
irrelevant aspects of item content, item format. 


or scoring criteria thar may differentially affect 
test scores of one or more groups of test rak¬ 
ers. When differential item functioning is 
detected, test developers try to identify plausi¬ 
ble explanations for the differences, and then 
they may replace or revise items chat give rise 
to group differences if consrrucr irrelevance is 
deemed likely. However, at this time, there has 
been little progress in discerning the cause or 
substantive themes that account for differen¬ 
tial item functioning on a group basis. Items 
for which the differential item functioning 
index is significant may constitute valid meas¬ 
ures of an element of the intended domain and 
differ in no way from other items that show 
nonsignificant indexes. When the differential 
item functioning index is significant, the test 
developer must take care that any replacement 
items or item revisions do not compromise 
the test specifications. 

When multiple forms of a test are pre¬ 
pared, the test specifications govern each of 
the forms. Also, when an item pool is devel¬ 
oped for a computerized adaptive test, the 
specifications refer both to the item poo! and 
to the rules or procedures by which the indi¬ 
vidual item sets are created for each test taker. 
Some of the attractive features of computer¬ 
ized adaptive tests, such as tailoring the diffi¬ 
culty level of the items to ihe test takers 
ability, place additional constraints on the 
design of such tests. In general, a large num¬ 
ber of items is needed for a computerized 
adaptive test to ensure that each Tailored item 
set meets the requirements of the tesr specifi¬ 
cations. Further, tests often arc developed in 
the context of larger systems or programs. 
Multiple item sets, for example, may be creat¬ 
ed for use with different groups of test takers 
or on different testing dates. Last, when a 
short form of a test is prepared, the tesr speci¬ 
fications of the original tesr govern the short 
form. Differences in the test specifications 
and the psychometric properties of the short 
form and the original test will affect the inter¬ 
pretation of the scores derived from the short 
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form. In any of these cases, the same funda¬ 
mental methods and principles of test devel¬ 
opment apply 

Special Considerations in Test 
Development 

This section elaborates on several topics dis¬ 
cussed above. First, considerations in delin¬ 
eating the framework for the test are discussed. 
Following this, considerations in the develop¬ 
ment of performance assessments and portfolios 
are addressed. 

Delineating the Framework for 
the Test 

The scenario presented above outlines what is 
often done to develop a test. However, the activ¬ 
ities do not always happen in a rigid sequence. 
There is often a subtle interplay between the 
process of conceptualizing a construct or con¬ 
tent domain and the development of a test of 
that construct or domain. The framework for 
the test provides a description of how the 
construct or domain will be represented. The 
procedures used to develop items and scoring 
rubrics and to examine item characteristics 
may often contribute to clarifying the frame¬ 
work. The extent to which the framework is 
defined a priori is dependent on the testing 
application, In many testing applications, a 
well-defined framework and detailed test speci¬ 
fications guide the development of items and 
their associated scoring rubrics and procedures. 
In some areas of psychological measurement, 
test development may be less dependent on 
an a priori defined framework and may rely 
more on a data-based approach that results in 
an empirically derived definition of the frame¬ 
work, In such instances, items are selected 
primarily on the basis of their empirical rela¬ 
tionship with an external criterion, their rela¬ 
tionships with one another, or their power to 
discriminate among groups of individuals. For 
example, construction of a selection lest for 
sales personnel might be guided by the corre¬ 


lations of item scores with productivity meas¬ 
ures of current sales personnel or a measure of 
client satisfaction might be assembled from those 
items in an item pool that correlate most highly 
with customer loyalty. Similarly, an inventory 
ro help identify different patterns of psychopa¬ 
thology might be developed using patients from 
different diagnostic subgroups. When test 
development relies on a data-based approach, 
it is likely that some items will be selected based 
on chance occurrences in the data. Cross-valida¬ 
tion studies are routinely conducted to deter¬ 
mine the tendency to select items by chance, 
which involves administering the test to a 
comparable sample. 

In many testing applications, the frame¬ 
work for the test is specified initially and this 
specification subsequently guides the develop¬ 
ment of items and scoring procedures. Empirical 
relationships may then be used to inform 
decisions about retaining, rejecting, or modi¬ 
fying items. Interpretations of scores from tests 
developed by this process have the advantage 
of a logical/theoretical and an empirical foun¬ 
dation for the underlying dimensions repre¬ 
sented by the test. 

Performance Assessments 

One distinction between performance 
assessments and other forms of tests has to do 
with the type of response that is required from 
the test takers. Performance assessments require 
the test takers to carry out a process such as 
playing a musical instrument or tuning a car’s 
engine or to produce a product such as a writ¬ 
ten essay. Performance assessments generally 
require the test takers to demonstrate their 
abilities or skills in settings that closely resem¬ 
ble real-life settings. For example, an assess¬ 
ment of a psychologist in training may require 
the test taker to interview a client, choose 
appropriate tests, and arrive at diagnosis and 
plan for therapy, Performance assessments are 
diverse in nature and can be product-based as 
well as behavior-based. Because performance 
assessments typically consist of a small num- 
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bcr of tasks, establishing the extent to which 
the results can be generalized to the broader 
domain is particularly important. The use of 
tesr specifications will contribute to tasks being 
developed so as to systematically represent the 
critical dimensions to be assessed, leading to a 
more comprehensive coverage of the domain 
than what would occur if test specifications were 
not used. Further, both logical and empirical 
evidence are important to document the extent 
to which performance assessments—tasks as 
well as scoring criteria—reflect the processes 
or skills that are specified by the domain 
definition. When tasks arc designed to elicit 
complex cognitive processes, logical analyses 
of the tasks and both logical and empirical 
analyses of the test takers' performances on 
the tasks provide necessary validity evidence. 

Portfolios 

A unique type of performance assessment is an 
individual portfolio. Portfolios are systematic 
collections of work or educational products 
typically collected over time. Like other assess¬ 
ment procedures, the design of portfolios is 
dependent on the purpose. Typical purposes 
include judgment of the improvement in job 
or educational performance and evaluation of 
the eligibility for employment, promotion, or 
graduation. A well-designed portfolio specifies 
the nature of the work that is to be put into the 
portfolio. The portfolio may indude entries such 
as representative products, the best work of the 
test taker, or indicators of progress. For example, 
in an employment setung involving promotion, 
employees may be instructed to indude their 
best work or products. Alternatively, if the pur¬ 
pose is to judge a student’s educational growth, 
students may be asked to provide evidence of 
improvement with respect to particular com¬ 
petencies or skills. They may also be requested 
to provide justifications for the choices. Soil other 
methods may indude the use of videotapes, exhi¬ 
bitions, demonstrations, simulations, and so on. 

In employment settings, employees may be 
involved in the selection of their work and prod¬ 


ucts that demonstrate their competencies for 
promotion purposes. Analogously, in educa¬ 
tional applications, students may participate in 
the selection of some of their work and the prod¬ 
ucts to be induded in their portfolios as well as 
in the evaluation of the materials. The specifi¬ 
cations for the portfolio indicarc who is respon¬ 
sible for selecting its contents. For example, the 
specifications may state that the test taker, the 
examiner, or both patties working together should 
be involved in the selection of the concents of the 
portfolio. The particular responsibilities of each 
party are delineated in the specifications. The 
more standardized the contents and procedures 
of administration, the easier it is to establish 
comparability of portfolio-based scores. 
Regardless of the methods used, all performance 
assessments are evaluated by the same standards 
of technical quality as other forms of tests. 

Test Revisions 

Tests and their supporting documents (e.g., test 
manuals, technical manuals, user's guides) are 
reviewed periodically to determine whether 
revisions are needed. Revisions or amendments 
are necessary when new research data, significant 
changes in the domain, or new conditions of 
test use and interpretation would either improve 
the validity of interpretations of the test scores 
or suggest that the cest is no longer folly appro¬ 
priate for its intended use. As an example, tests 
are revised if the test content or language has be¬ 
come outdated and, therefore, may subsequently 
affect the validity of the test score interpretations. 
Revisions to test content are also made to ensure 
the confidentiality of the test. It should be noted, 
however, that outdated norms may not have the 
same implications for revisions as an outdated test 
For example, it may be necessary to update the 
norms for an achievement cest after a period of 
rising or falling achievement in the norming 
population, or when there are changes in the 
test-taking population, but the test content 
itself may continue to be as relevant as it was 
when the test was developed. 
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Standard 3.1 

Tests and testing programs should be devel¬ 
oped on a sound scientific basis. Test devel¬ 
opers and publishers should compile and 
document adequate evidence bearing on 
test development. 

Standard 3.2 

The purpose(s) of the test, definition of the 
domain, and the test specifications should 
be stated clearly so that judgments can be 
made about the appropriateness of the 
defined domain for the stated purpose(s) 
of the test and about the relation of items 
to the dimensions of the domain they are 
intended to represent. 

Comment: The adequacy and usefulness of 
test interpretations depend on the rigor with 
which the purposes of the test and the domain 
represented by the rest have been defined and 
explicated. The domain definition should be 
sufficiently detailed and delimited to show 
clearly whac dimensions of knowledge, skill, 
processes, attitude, values, emotions, or 
behavior are included and what dimensions 
are excluded. A clear description will enhance 
accurate judgments by reviewers and others 
about the congruence of the defined domain 
and the rest items. 

Standard 3.3 

The test specifications should be document¬ 
ed, along with their rationale and the 
process by which they were developed. The 
test specifications should define the content 
of the test, the proposed number of items, 
the item formats, the desired psychometric 
properties of the items, and the item and 
section arrangement. They should also speci¬ 
fy the amount of time for testing, directions 
to the test takers, procedures to be used for 
test administration and scoring, and other 
relevant information. 


Comment: Professional judgment plays a major 
role in developing the cest specifications. The 
specific procedures used for developing the 
specifications depend on the purposes of the 
test. For example, in developing licensure and 
certification tests, practice analyses or job analy¬ 
ses usually provide the basis for defining the 
test specifications, and job analyses primarily 
serve this function for employment tests. For 
achievement tests to be given at the end of a 
course, the test specifications should be based 
on an outline of course content and goals. 
Whereas, for placement tests, it may be nec¬ 
essary to examine the required entry knowl¬ 
edge and skills for several courses. 

Standard 3.4 

The procedures used to interpret test scores, 
and, when appropriate, the normative or 
standardization samples or the criterion used 
should be documented. 

Comment: Test specifications may indicate that 
the intended score interpretations are for absolute 
or relative score interpretations, or both. In rel¬ 
ative score interpretations the status of an indi¬ 
vidual (or group) is determined by comparing 
the score (or mean score) to the performance of 
others in one or more defined populations. In 
absolute score interpretations, the score or aver¬ 
age is assumed to reflect directly a level of com¬ 
petence or mastery in some defined criterion 
domain. Tests designed to facilitate one type of 
interpretation function less effectively for other 
types of interpretations. Given appropriate rest 
design and adequate supporting data, however, 
scores arising from norm-referenced resting pro¬ 
grams may provide reasonable absolute score 
interpretations and scores arising from criterion- 
referenced programs may provide reasonable 
relative score inteipretations. 

Standard 3.5 

When appropriate, relevant experts external 
to the testing program should review the test 
specifications. The purpose of the review, the 

43 


JA2649 



Case l:14-cv-00857-TSC Document 60-85 
USCA Case #17-7035 Document #1715850 

[standards 


Filed 12/21/15 Page 55 of 100 
Filed: 01/31/2018 Page 346 of 517 

TEST DEVELOPMENT AND REVISION / PART I 


process by which the review is conducted, 
and the results of the review should be docu¬ 
mented. The qualifications, relevant experi¬ 
ences, and demographic characteristics of 
expert judges should also be documented. 

Comment: Expert review of the test specifica¬ 
tions may serve many useful purposes such as 
helping to assure content quality and repre¬ 
sentativeness. The expert judges may include 
individuals representing defined populations 
of concern to the test specifications. For exam¬ 
ple, if the test is related to ethnic minority 
concerns, the expert review typically includes 
members of appropriate ethnic minority 
groups or experts on minority group issues. 

Standard 3.6 

The type of items, the response formats, scor¬ 
ing procedures, and test administration proce¬ 
dures should be selected based on the proposes 
of the test, the domain to be measured, and 
the intended test takers. To the extent possible, 
test content should be chosen to ensure that 
intended inferences from test scores are equally 
valid for members of different groups of test 
takers. The test review process should include 
empirical analyses and, when appropriate, the 
use of expert judges to review items and 
response formats. The qualifications, relevant 
experiences, and demographic characteristics 
of expert judges should also be documented. 

Comment: Expert judges may be asked to iden¬ 
tify material likely to be inappropriate, confus¬ 
ing, or offensive for groups in the test-taking 
population. For example, judges may be asked 
to identify whether lack of exposure to problem 
contexts in mathematics word problems may 
be of concern for some groups of students. 
Various groups of test takers can be defined by 
characteristics such as age, ethnicity, culture, 
gender, disability, or demographic region. 
There is limited evidence, however, that expert 
reviews alleviate problems with bias in testing 
(see chapter 7). 

44 


Standard 3.7 

The procedures used to develop, review, and 
try out items, and to select items from the 
item pool should be documented. If the 
items were classified into different categories 
or subtests according to the test specifica¬ 
tions, the procedures used for the classifica¬ 
tion and the appropriateness and accuracy 
of the classification should be documented. 

Comment: Empirical evidence and/or expert 
judgment are used to classify items according 
to categories of the rest specifications. For 
example, professional panels may be used for 
classifying the items or for determining the 
appropriateness of the developer's classifica¬ 
tion scheme. The panel and procedures used 
should be chosen with care as they will affect 
the accuracy of the classification. 

Standard 3.8 

When item tryouts or field tests are con¬ 
ducted, the procedures used to select the 
sample(s) of test takers for item tryouts and 
the resulting characteristics of the sample(s) 
should be documented. When appropriate, 
the sample(s) should be as representative as 
possible of the population(s) for which the 
test is intended. 

Comment: Conditions which may differential¬ 
ly affect performance on the test items by the 
sample(s) as compared to the intended popu¬ 
lation^) should be documented when appro¬ 
priate. As an example, test takers may be less 
motivated when they know their scores will 
not have an impact on them. 

Standard 3.9 

When a test developer evaluates the psycho¬ 
metric properties of items, the classical or 
item response theory (IRT) model used for 
evaluating the psychometric properties of 
items should be documented. The sample used 
for estimating item properties should he de- 
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scribed and should be of adequate size and diver¬ 
sity for the procedure. The process by which 
items are selected and the data used for item 
selection, such as item difficulty, item discrimi¬ 
nation, and/or item information, should also 
be documented. When 1RT is used to estimate 
item parameters in test development, the item 
response model, estimation procedures, and 
evidence of model fit should be documented. 

Comment Although overall sample size is 
important, it is important also that there be an 
adequate number of cases in regions critical to 
the determination of the psychometric proper¬ 
ties of items. If the test is to achieve greatest 
precision in a particular part of the score scale 
and this consideration affects item selection, 
the manner in which item statistics are used 
needs to be carefully described. When 1RT is 
used as the basis of test development, it is 
important to document the adequacy of fit of 
the model to the data. This is accomplished by 
providing information about the extent to 
which 1RT assumptions (e.g., unidimensionali¬ 
ty, local item independence, or equality of slope 
parameters) are satisfied. 

Test developers should show that any dif¬ 
ferences between the administration conditions 
of the field test and the final form do not affect 
item performance. Conditions that can affect 
item statistics include item position, time 
limits, length of test, mode of testing (e.g., 
paper-and-pencil versus computer-administered), 
and use of calculators or other tools. For exam¬ 
ple, in field testing items, those placed at the 
end of a test might obtain poorer item statis¬ 
tics chan those inserted within the test. 

Standard 3.10 

Test developers should conduct cross-valida¬ 
tion studies when items are selected primari¬ 
ly on the basis of empirical relationships 
rather than on the basis of content or theoreti¬ 
cal considerations. The extent to which the dif¬ 
ferent studies identify the same item set should 
be documented. 


Comment: When data-based approaches to test 
development are used, items are selected prima¬ 
rily on the basis of their empirical relationships 
with an external criterion, their relationships 
with one another, or their power to discrimi¬ 
nate among groups of individuals. Under these 
circumstances, it is likely that some items will 
be selected based on chance occurrences in the 
data used. Administering the test to a compara¬ 
ble sample of test takers or a hold-out sample 
provides a means by which the tendency to 
select items by chance can be determined 

Standard 3.11 

Test developers should document the extent to 
which the content domain of a test represents 
the defined domain and test specifications. 

Comment: Test developers should provide evi¬ 
dence of the extent to which the test items and 
scoring criteria represent the defined domain. This 
affords a basis to help determine whether per¬ 
formance on the test can be generalized to the 
domain that is being assessed. This is especially 
important for tesc that contain a small number 
of items such as performance assessments. Such 
evidence may be provided by expert judges. 

Standard 3.12 

The rationale and supporting evidence for 
computerized adaptive tests should be docu¬ 
mented. This documentation should include 
procedures used in selecting subsets of items 
for administration, in determining the start¬ 
ing point and termination conditions for the 
test, in scoring the test, and for controlling 
item exposure. 

Comment: It is important to assure that docu¬ 
mentation of the procedures does not com- 
ptomise the security of the test items. 

If a computerized adaptive test is intended 
to measure a number of different content sub- 
caregpries, item selection procedures are to assure 
that the subcategories are adequately represented 
by the items presented to the test taker 
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Standard 3.13 

When a test score is derived from the differen¬ 
tial weighting of items, the test developer 
should document the rationale and process used 
to develop, review, and assign item weights. 
When the item weights are obtained based on 
empirical data, the sample used for obtaining 
item weights should be sufficiently large and 
representative of the population for which the 
test is intended. When the item weights are 
obtained based on expen judgment, the quali¬ 
fications of the judges should be documented. 

Comment: Changes in the population of test 
takers, along with other changes such as changes 
in instructions, Training, or job requirements, 
may impact the original derived item weights, 
necessitating subsequent studies after an 
appropriate period of time. 

Standard 3.14 

The criteria used for scoring test takers' per¬ 
formance on extended-response items should be 
documented. This documentation is especially 
important for performance assessments, such as 
storable portfolios and essays, where the criteria 
for scoring may not be obvious to the user. 

Comment: The completeness and clarity of the 
test specifications, including the definition of the 
domain, are essential in developing the scoring 
criteria. The test developer needs to provide a 
clear description of how the test scores ate 
intended to be interpreted to help ensure the 
appropriateness of the scoring procedures. 

Standard 3.15 

When using a standardized testing format to 
collect structured behavior samples, the domain, 
test design, test specifications, and materials 
should be documented as for any other test. 
Such documentation should include a clear 
definition of the behavior expected of the test 
takers, the nature of the expected responses, and 
any materials or directions that are necessary 
to carry out the testing. 


Comment In developing a prompt, the age, lan¬ 
guage, experience, and ability level of test taken 
should be considered, as should other possible 
unique sources of difficulty for groups in the po¬ 
pulation to be tested Test directions that specify 
umc allowances, nature of the responses expect¬ 
ed, and rules regarding use of supplementary 
materials, such as notes, references, dictionaries, 
calculators, or manipulatives such as lab equip¬ 
ment, may be established via field testing. 

Standard 3.16 

If a short form of a test is prepared, for exam¬ 
ple, by reducing the number of items on the 
original test or organizing portions of a test into 
a separate form, the specifications of the short 
form should be as similar as possible to those 
of the original test. The procedures used for 
the reduction of items should be documented. 

Comment: The extent to which the specifica¬ 
tions of the short form differ from those of 
the original test, and the implications of such 
differences for interpreting the scores derived 
from the short form, should be documented. 

Standard 3.17 

When previous research indicates that irrele¬ 
vant variance could confound the domain def¬ 
inition underlying the test, then to the extent 
feasible, the test developer should investigate 
sources of irrelevant variance. Where possible, 
such sources of irrelevant vanance should be 
removed or reduced by the test developer. 

Standard 3.18 

For tests that have time limits, test development 
research should examine the degree to which 
scores include a speed component and evaluate 
the appropriateness of that component, given 
the domain the test is designed to measure. 

Standard 3.19 

The directions for test administration should 
be presented with sufficient clarity and empha- 


46 


JA2652 


AERA_APA_NCME_0000056 



Case 1:14-cv-00857-TSC Document 60-85 

USCA Case #17-7035 Document #1715850 

PART I / TEST DEVELOPMENT AND REVISION 


Filed 12/21/15 Page 58 of 100 
Filed: 01/31/2018 Page 349 of 517 


STANDARDS] 


sis so that it is possible for others to replicate 
adequately the administration conditions under 
which the data on reliability and validity, and, 
where appropriate, norms were obtained. 

Comment: Because all people administering 
tests, including those in schools, industry, and 
clinics, need to follow test administration con¬ 
ditions carefully, it is essential that test admin¬ 
istrators receive detailed instructions on test 
administration guidelines and procedures. 

Standard 3.20 

The instructions presented to test takers should 
contain sufficient detail so that test takers can 
respond to a task in the manner that the test 
developer intended. When appropriate, sample 
material, practice or sample questions, criteria 
for scoring, and a representative item identi¬ 
fied with each major area in the tests classifi¬ 
cation or domain should be provided to the 
test takers prior to the administration of the 
test or included in the testing material as part 
of the standard administration instructions. 

Comment: For example, in a personality 
inventory it may be intended chat test takers 
give the first response that occurs to them. 
Such an expectation should be made clear in 
the inventory directions. As another example, 
in directions for interest or occupational 
inventories, it may be important co specify 
whether test takers are to mark the activities 
they would like ideally or whether they are 
to consider both their opportunity and their 
ability realistically. 

The extent and nature or practice materi¬ 
als and directions depend on expected levels 
of knowledge among test takers. For example, 
in using a novel test format, it may be very 
important to provide the test taker a practice 
opportunity as part of the test administration. 
In some resting situations, ic may be important 
for the instructions to address such matters as 
the effects that guessing and time limits have 
on test scores. If expansion or elaboration of 
the test instructions is permitted, the condi¬ 


tions under which this may be done should be 
stated clearly in the form of general rules and 
by giving representative examples. If no expan¬ 
sion or elaboration is to be permitted, this 
should be stated explicitly. Publishers should 
include guidance for dealing with typical 
questions from test takers. Users should be 
instructed how to deal with questions that 
may arise during the testing period. 

Standard 3.21 

If the test developer indicates that the condi¬ 
tions of administration are permitted to vary 
from one test taker or group to another, per¬ 
missible variation in conditions for adminis¬ 
tration should be identified, and a rationale 
for permitting the different conditions should 
be documented. 

Comment: In deciding whether the conditions 
of administration can vary, the test developer 
needs to consider and study the potential 
effects of varying conditions of administra¬ 
tion. If conditions of administration vary 
from the conditions studied by the test devel¬ 
oper or from those used in the development 
of norms, the comparability of the test scores 
may be weakened and the applicability of the 
norms can be questioned. 

Standard 3.22 

Procedures for scoring and, if relevant, 
scoring criteria should be presented by 
the test developer in sufficient detail and 
clarity to maximize the accuracy of scoring. 
Instructions for using rating scales or for 
deriving scores obtained by coding, scaling, 
or classifying constructed responses should 
be clear. This Is especially critical if tests 
can be scored locally. 

Standard 3.23 

The process for selecting, training, and qualify¬ 
ing scorers should be documented by the test 
developer. The training materials, such as the 
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scoring rubrics and examples of tcsc takers’ 
responses that illustrate the levels on the score 
scale, and the procedures for training scorers 
should result in a degree of agreement among 
scorers that allows for the scores to be interpret¬ 
ed as originally intended by die test developer. 
Scorer reliability and potential drift over time 
in raters' scoring standards should be evaluat¬ 
ed and reported by the person(s) responsible 
for conducting the training session. 

Standard 3.24 

When scoring is done locally and requires 
scorer judgment, the test user is responsible 
for providing adequate training and instruc¬ 
tion to the scorers and for examining scorer 
agreement and accuracy. The test developer 
should document the expected level of scorer 
agreement and accuracy. 

Comment: A common practice of test devel¬ 
opers is to provide examples of training mate¬ 
rials (e.g., scoring rubrics, test rakers' responses 
at each score level) and procedures when scoring 
is done locally and requires scorer judgment. 

Standard 3.25 

A test should be amended or revised when 
new research data, significant changes in the 
domain represented, or newly recommended 
conditions of rest use may lower the validity 
of test score interpretations. Although a test 
that remains useful need not be withdrawn 
or revised simply because of the passage of 
time, test developers and test publishers are 
responsible for monitoring changing condi¬ 
tions and for amending, revising, or with¬ 
drawing the test as indicated. 

Comment: Test developers need to consider a 
number of factors that may warrant the revi¬ 
sion of a test, including outdated rest content 
and language. If an older version of a test is 
used when a newer version has been published 
or made available, rest users are responsible for 


providing evidence that the older version is 
as appropriate as the new version for that 
particular test use. 

Standard 3.26 

Tests should be labeled or advertised as 
"revised” only when they have been revised 
in significant ways. A phrase such as “with 
minor modification” should be used when 
the test has been modified in minor ways. 
The score scale should be adjusted to account 
for these modifications, and users should be 
informed of the adjustments made to the 
score scale. 

Comment: It is the test developers responsi¬ 
bility to determine whether revisions to a test 
would influence test score interprerarions. If 
test score interpretations would be affected 
by the revisions, it would then be appropriate 
to label the test "revised.” When tests are 
revised, the nature of the revisions and their 
implications on test score interpretations 
should be documented. 

Standard 3.27 

If a test or part of a test is intended for 
research use only and is not distributed for 
operational use, statements to this effect 
should be displayed prominently on all rele¬ 
vant test administration and interpretation 
materials that are provided to the test user. 

Comment: This standard refers to tests that 
are inrended for research use only and does 
not refer to standard test development func¬ 
tions that occur prior to the operational use 
of a test (e.g., field testing). 
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4. SCALES, NORMS, AND SCORE 
COMPARABILITY 


Background 

Test scores are reported on scales designed to 
assist score interpretation. Typically, scoring 
begins with responses to separate test items, 
which are often coded using 0 or 1 to represent 
wrong/right or negative/positive, but sometimes 
using numerical values to indicate finer response 
gradations. Then the item scores are combined, 
often by addition but sometimes by a more 
elaborate procedure, to obtain a raw score. Raw 
scores are determined, in part, by features of a 
test such as test length, choice of time limit, 
item difficulties, and the circumstances under 
which the test is administered. This makes raw 
scores difficult to interpret in the absence of 
further information. Interpretation and statisti¬ 
cal analyses may be facilitated by converting 
raw scores into an entirely different set of val¬ 
ues called derived scores or scale scores. The vari¬ 
ous scales used for reporting scores on college 
admissions tests, the standard scores often 
used to report results for intelligence scales or 
vocational interest and personality inventories, 
and the grade equivalents reported for achieve¬ 
ment tests in the elementary grades are exam¬ 
ples ofscale scores. The process of developing 
such a score scale is called scaling a test. Scale 
scores may aid interpretation by indicating 
how a given score compares to those of other 
test takers, by enhancing the comparability of 
scores obtained using different forms of a test, 
or in other ways. 

Another way of assisting score interpreta¬ 
tion is to establish standards or cut scores that 
distinguish different score ranges. In some 
cases, a single cut score may define the bound¬ 
ary between passing and failing. In other cases, 
a series of cut scores may define distinct pro¬ 
ficiency levels. Cut scores may be established 
for either raw or scale scores. Both scale scores 
and standards or cut scores can be central to 
the use and interpretation of test scores. For 


that reason, their defensibiiity is an important 
consideration in test validation. There is a dose 
connection between standards or cut scores 
and certain scale scores. If the successive score 
ranges defined by a series of cut scores ate 
relabeled, say 0, 1,2, and so on, then a scale 
score has been created. 

In addition to facilitating interpretations 
of a single test form considered in isolation, 
scale scores are often created to enhance com¬ 
parability across different forms of the same 
test, across test formats or administration 
conditions, or even across tests designed to 
measure different constructs (e.g., related sub¬ 
tests in a battery). Equated scores from alter¬ 
nate forms of a test can often be interpreted 
more easily when expressed in scale score units 
rather than raw score units. Scaling may be 
used to place scores from different levels of an 
achievement test on a continuous scale and 
thereby facilitate inferences about growth or 
development. Scaling can also enhance the 
comparability of scores derived from tests in 
different areas, as in subtests within an apti¬ 
tude, interest, or achievement battery. 

Norm-Referenced and Criterion- 
Referenced Score Interpretations 

Individual raw scores or scale scores are often 
referred to the distribution of scores for one 
or more comparison groups to draw useful 
inferences about an individual's performance. 
Test score interpretations based on such compar¬ 
isons are said to be norm-referenced Percentile 
rank norms, for example, indicaee the stand¬ 
ing of an individual or group within a defined 
population of individuals or groups. An example 
of such a comparison group might be fourth- 
grade students in the United States, tested in 
the last 2 months of a recent school year. 
Percentiles, averages, or other statistics for such 
reference groups are called norms. By showing 
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how the test score of a given examinee com¬ 
pares to those of others, norms assist tn the 
classification or description of examinees. 

Other test score interpretations make no 
direct reference to the performance of other 
examinees. These interpretations may take a 
variety of forms; most are collectively referred 
to as criterion-referenced interpretations. Derived 
scores supporting such inrerprcrations may 
indicate the likely proportion of correct 
responses on some larger domain of items, or 
the probability of an examinee's answering 
particular sorts of items correctly. Other crite¬ 
rion-referenced interpretations may indicate 
the likelihood that some psychopathology is 
present. Still other criterion-referenced inter¬ 
pretations indicate the probability that an 
examinees level of tested knowledge or skill 
is adequate to perform successfully in some 
other setting; such probabilities may be sum¬ 
marized in an expectancy table. Scale scores 
to support such criterion-referenced score 
interpretations are often developed on the 
basis of statistical analyses of the relationships 
of test scores to other variables. 

Some scale scores are developed primarily 
to support norm-referenced interpretations 
and others, criterion-referenced interpretations. 
In practice, however, there is not always a sharp 
distinction. Borh criterion-referenced and 
norm-referenced scales may be developed and 
used for the same test scores. Moreover, a 
norm-referenced score scale originally devel¬ 
oped, for example, to indicate performance 
relative to some specific reference population 
might, over time, also come to support crite¬ 
rion-referenced interpretations. This could 
happen as research and experience brought 
increased understanding of the capabilities 
implied by different scale score levels. 
Conversely, results of an educational assess¬ 
ment might be reported on a scale consisting 
of several ordered proficiency levels, defined 
by descriptions of the kinds of casks students 
at each level were able to perform. That would 
be a criterion-referenced scale, bur once the 


distribution of scores over levels was reported, 
say, for all eighth-grade students in a given 
state, individual students’ scores would also 
convey information about their standing rela¬ 
tive to that tested population. 

Interpretations based on cut scores may 
likewise be either criterion-referenced or 
norm-referenced. If qualitatively different 
descriptions are attached to successive score 
ranges, a criterion-referenced interpretation is 
supported. For example, the descriptions of 
performance levels in some assessment task 
scoring rubrics can enhance score interpreta¬ 
tion by summarizing the capabilities that must 
be demonstrated to merit a given score. In 
other cases, criterion-referenced interpretations 
may be based on empirically determined rela¬ 
tionships between test scores and other vari¬ 
ables. But when tests are used for selection, it 
may be appropriate to rank-order examinees 
according to their test performance and estab¬ 
lish a cut score so as to select a prespecified 
number or proportion of examinees from one 
end of the distribution, if the selection use is 
otherwise supported by relevant reliability 
and validity evidence. In such cases, the cut 
score interpretation is norm-referenced, the 
labels reject or fail versus accept or pass are 
determined solely by an examinee’s standing 
relative to others tested. 

Criterion-referenced interpretations based 
on cut scores are sometimes criticized on the 
grounds that there is very rarely a sharp dis¬ 
tinction of any kind becwcen those just below 
versus just above a cut score. A neuropsy¬ 
chological tesc may be helpful in diagnosing 
some particular impairment, for example, but 
the probability that the impairment is pres¬ 
ent is likely to increase continuously as a 
function of the rest score. Cut scores may 
nonetheless aid in formulating rules for 
reaching decisions on the basis of test per¬ 
formance. It should be recognized, however, 
that the probability of misclassification will 
generally be relatively high for persons with 
scores close co the cut points 
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The validity of norm-referenced interpretations 
depends in part on the appropriateness of the 
reference group to which test scores are com¬ 
pared. Norms based on hospitalized patients, 
for example, might be inappropriate for some 
interpretations of nonhospicalized patients’ 
scores. Thus, it is important that reference 
populations be carefully^dcfined and clearly 
described. Validity of such interpretations also 
depends on the accuracy with which norms 
summarize the performance of the reference 
population. That population may be small 
enough that essentially the entire population 
can be tested (e.g., all pupils at a given grade 
level in a given district tested on the same 
occasion). Often, however, only a sample of 
examinees from the reference population is 
tested. It is then important that the norms be 
based on a technically sound, representative, 
scientific sample of sufficient size. Patients in 
a few hospitals in a small geographic region 
are unlikely to be representative of all patients 
in the United States, for example, Moreover, 
the appropriateness of norms based on a given 
sample may diminish over time. Thus, for tests 
that have been in use for a number of years, 
periodic review is generally required to assure 
the continued utility of norms. Renorming may 
be required to maintain the validity of norm- 
referenced test score interpretations. 

More than one reference population may 
be appropriate for the same test. For example, 
achievement test performance might be inter¬ 
preted by reference to local norms based on 
sampling from a particular school district, 
norms for a state or type of community, or 
national norms. For other tests, norms might 
be based on occupational or educational clas¬ 
sifications. Descriptive statistics for all exam¬ 
inees who happen to be tested during a given 
period of time (sometimes called user norms 
or program norms ) may be useful for some 
purposes, such as describing trends over time. 
But there must be sound reason to regard that 


group of test takers as an appropriate basis for 
such inferences. When there is a suitable ration¬ 
ale for using such a group, the descriptive sta¬ 
tistics should be clearly characterized as being 
based on a sample of persons routinely tested 
as part of an ongoing program. 

Comparability and Equating 

Many test uses involve different versions of 
the same test, which yield scores that can be 
used interchangeably even though they are 
based on different sets of items. In testing 
programs that offer a choice of examination 
dates, for example, test security may be com¬ 
promised if the same form is used repeatedly. 
Other testing applications may entail repeated 
measurements of the same individuals, perhaps 
to measure change in levels of psychological 
dysfunction, change in attitudes, or educa¬ 
tional progress. In such contexts, reuse of the 
same set of test items may result in correlated 
errors of measurement and biased estimates 
of change. When distinct forms of a test are 
constructed to the same explicit content and 
statistical specifications and administered 
under identical conditions, they are referred 
to as alternate forms or sometimes parallel or 
equivalent forms. The process of placing scores 
from such alternate forms on a common scale 
is called equating. Equating is analogous to 
the calibration of different balances so that 
they all indicate the same weight for any given 
object. However, the equating process for test 
scores is more complex. It involves small statis¬ 
tical adjustments to account for minor differ¬ 
ences in the difficulty and statistical properties 
of the alternate forms. 

In theory, equating should provide accu¬ 
rate score conversions for any set of persons 
drawn from the examinee population for which 
the test is designed. Furthermore, the same 
score conversion should be appropriate regard¬ 
less of the score interpretation or use intend¬ 
ed. It is not possible to construct conversions 
with these ideal properties between scores on 


51 


JA2657 



Case l:14-cv-00857-TSC Document 60-85 

USCA Case #17-7035 Document #1715850 


Filed 12/21/15 Page 63 of 100 
Filed: 01/31/2018 Page 354 of 


SCALES, NORMS, AND SCORE COMPARABIUTY I PART I 


tests that measure different constructs; that 
differ materially in difficulty, reliability, time 
limits, or other conditions of administration; 
or that are designed to different specifications. 

There is another assessment approach 
that may provide interchangeable scores based 
on responses to different items using different 
methods, not referred to as equating. This is 
rhe use o i adaptive tesu. It has long been rec¬ 
ognized that little is learned from examinees' 
responses to items that are much too easy or 
much too difficult for them. Consequently, 
some testing procedures use only a subset of 
the available items with each examinee in 
order to avoid boredom or frustration, or to 
shorten testing time. An adaptive test con¬ 
sists of a pool of items together with rules 
for selecting a subser of those items to be 
administered to an individual examinee, and 
a procedure for placing different examinees’ 
scores on a common scale. The selection 
of successive items is based in part on the 
examinee’s responses to previous items. The 
item pool and item selection rules may be 
designed so that each examinee receives a 
representative set of items, of appropriate 
difficulty. The selection rules generally 
assure that an acceptable degree of precision 
is attained before testing is terminated. At 
one time, such tailored testing was limited 
ro certain individually administered psy¬ 
chological tests. With advances in item 
response theory (IRT) and in computer 
technology, however, adaptive testing is 
becoming more sophisticated. With some 
adaptive tests, it may happen that two 
examinees rarely if ever respond to precisely 
the same set of items. Moreover, two exam¬ 
inees talcing the same adaptive test may be 
given sets of items that differ markedly in 
difficulty. Nevertheless, when certain statis¬ 
tical and content conditions are met, test 
scores produced by an adaptive resting sys¬ 
tem can function like scores from equated 
alternate forms. 


Scaling to Achieve Comparability 

The term equating is properly reserved only 
for score conversions derived for alternate forms 
of the same test. It is often useful, however, to 
compare scores from tests that cannot, in the¬ 
ory, be equated. For example, it may be desir¬ 
able to interpret scores from a shortened (and 
hence less reliable) form of a test by first con 
vetting them to corresponding scores on the 
full-length version. For the evaluation of exam¬ 
inee growrh over time, it may be desirable to 
develop scales that span a broad range of devel¬ 
opmental or educational levels. Test revision 
often brings a need for some linkage between 
scores obtained using newer and older editions. 
International comparative studies or use with 
hearing-impaired examinees may require test 
forms in different languages. In still other 
cases, linkages or alignments may be created 
between tests measuring different constructs, 
perhaps comparing an aptitude with a form 
of behavior, or linking measures of achieve¬ 
ment in several content areas. Scores from 
such rests may somerimes be aligned or pre¬ 
sented in a concordance table to aid users in 
estimating relative performance on one test 
from performance on another. 

Score conversions ro facilitate such com¬ 
parisons may be described using terms like 
linkage, calibration, concordance, projection, 
moderation, or anchoring. These weaker score 
linkages may be Technically sound and may 
fully satisfy desired goals of comparability for 
one purpose or for one subgroup of examinees, 
bur rhey cannot be assumed to be stable over 
time or invariant across multiple subgroups of 
the examinee population nor is there any assur¬ 
ance that scores obtained using different tests 
will be equally accurate Thus, their use for other 
purposes or with other populations than origi¬ 
nally intended may require additional research. 

For example, a score conversion that was accu¬ 
rate for a group of native speakers might sys¬ 
tematically overpredicc or underpredict the 
scores of a group of nonnative speakers. 
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Cut Scores 

A critical step in the development and use of 
some tests is to establish one or more cut points 
dividing the score range to partition the dis¬ 
tribution of scores into categories. These cate¬ 
gories may be used just for descriptive purposes 
or may be used to distinguish among exam¬ 
inees for whom different programs are deemed 
desirable or different predictions are warrant¬ 
ed. An employer may determine a cut score 
to screen potential employees or promote cur¬ 
rent employees; a school may use test scores 
to decide which of several alternative instruc¬ 
tional programs would be most beneficial for 
a student; in granting a professional license, a 
state may specify a minimum passing score 
on a licensure test. 

These examples differ in important 
respects, but all involve delineating categories 
of examinees on the basis of test scores. Such 
cut scores embody the rules according to which 
tests are used or interpreted. Thus, in some 
situations the validity of test interpretations 
may hinge on the cut scores. There can be no 
single method for determining cut scores for 
all tests or for all purposes, nor can there be 
any single set of procedures for establishing 
their defensibility. These examples serve only 
as illustrations. 

The first example, that of an employer 
hiring all those who earn scores above a given 
level on an employment test, is most straight¬ 
forward. Assuming that the employment test 
is valid for its intended use, average job per¬ 
formance would typically be expected to rise 
steadily, albeit slowly, with each increment in 
test score, at least for some range of scores 
surrounding the cut point. In such a case the 
designation of the particular value for the cut 
point may be largely determined by the num¬ 
ber of persons to be hired or promoted. There 
is no sharp difference between those just below 
the cut point and those just above it, and the 
use of the cut score does not entail any crite¬ 
rion-referenced interpretation. This method 


of establishing a cut score may be subject to 
legal requirements with respect to the nature 
of the validity and reliability evidence needed 
to support the use of rank-order selections 
and the unavailability of effective alternative 
selection methods, if it has a disproportionate 
effect on one or more subgroups of employees 
or prospective employees. 

In the second example, a school district 
might structure its courses in writing around 
three categories of needs. For children whose 
proficiency is least developed, instruction 
mighr be provided in small groups, with con¬ 
siderable individual attention to assist them 
in creating meaningful written stories grounded 
in their own experience. For children whose 
proficiency was further developed, more empha¬ 
sis might be placed on systematic exploration 
of the stages of the writing process. Instruction 
for children at the highest proficiency level might 
emphasize mastery of specific writing genres 
or prose structures used in mote formal writ¬ 
ing. In an appropriate implementation of such 
a program, children could easily be transferred 
from one level to another if their original 
placement was in error or as their proficiency 
increased. Ideally, cut scores delineating cate¬ 
gories in this application would be based on 
research demonstrating empirically that pupils 
in successive score ranges did most often ben¬ 
efit more from the respective treatments to 
which they were assigned than from the alter¬ 
natives available. It would typically be found 
that between those score ranges in which one 
or another instructional treatment was clearly 
superior, there was an intermediate region in 
which neither treatment was clearly preferred. 
The cut score might be located somewhere in 
that intermediate region. 

In the final example, that of a professional 
licensure examination, the cut score represents 
an informed judgment that those scoring below 
ir are likely to make serious errors for want of 
the knowledge or skills tested. Little evidence 
apart from errors made on the test itself may 
document the need to deny the right to prac- 
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rice the profession No rest is perfeci, of 
course, and regardless of the cue score chosen, 
some examinees with inadequate skills are 
likely ro pass and some with adequate skills 
are likely to fail. The relative probabilities of 
such false positive and false negative errors 
will vary depending on the cut score chosen, 
A given probability of exposing rhe public 
to potential harm by issuing a license to an 
incompetent individual (false positive) must 
be weighed against some corresponding 
probability of denying a license to, and there¬ 
by disenfranchising, a qualified examinee 
(false negative). Changing the cut score to 
reduce either probability will increase the 
other, although both kinds of errors can be 
minimized through sound test design that 
anticipates the role of the cut score in test use 
and interpretation. Determining cut scores 
in such situations cannot be a purely tech¬ 
nical matter, although empirical studies 
and statistical models can be of great value 
in informing the process. 

Cut scores embody value judgments as 
well as technical and empirical considerations. 
Where the results of the standard-setting process 
have highly significant consequences, and 
especially where large numbers of examinees 
are involved, those responsible for establish¬ 
ing cut scores should be concerned that the 
process by which cut scores are determined be 
clearly documented and defensible. The qual¬ 
ifications of any judges involved in standard 
setting and the process by which chey are 
selected arc part of that documentation. Care 
must be taken to assure rhac judges under¬ 
stand what they are ro do. The process musr 
be such that well-qualified judges can apply 
their knowledge and experience to reach 
meaningful and relevant judgments that accu¬ 
rately reflect their understandings and inten¬ 
tions. A sufficiently large and representative 
group of judges should be involved to provide 
reasonable assurance that results would not 
vary greatly if the process were replicated. 


Standard 4.1 

Test documents should provide test users 
with clear explanations of the meaning and 
intended interpretation of derived score scales, 
as well as their limitations. 

Comment. All scales (raw score or derived) may 
be subject to misinterpretation. Sometimes 
scales are extrapolated beyond the range of 
available data or are interpolated without suffi¬ 
cient data points. Grade- and age-equivalent 
scores have been criticized in this regard, but 
percentile ranks and standard score scales are 
also subject to misinterpretation. If the nature 
or intended uses of a scale are novel, it is espe¬ 
cially important that its uses, interpretations, 
and limitations be clearly described. Illustrations 
of appropriate versus inappropriate interpreta¬ 
tions may be helpful, especially for types of 
scales or interpretations that may be unfamiliar 
to most users. This standard perrains to score 
scales intended for criterion-referenced as well 
as for norm-referenced interpretation. 

Standard 4.2 

The construction of scales used for report¬ 
ing scores should be described clearly in 
test documents. 

Comment: When scales, norms, or other 
interpretive systems are provided by the test 
developer, technical documentation should 
enable users to judge the quality and preci¬ 
sion of the resulting derived scores. This 
standard pertains to score scales intended fot 
criterion-referenced as well as fbi norm-refer¬ 
enced interpretation. 

Standard 4.3 

If there is sound reason to believe that spe¬ 
cific misinterpretations of a score scale are 
likely, test users should be explicitly fore¬ 
warned. 
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Comment: Test publishers and users can reduce 
misinterpretations of grade-equivalent scores, 
for example, by ensuring that such scores are 
accompanied by instructions that make clear 
that grade-equivalent scores do not represent a 
standard of growth per year or grade and that 
roughly 50% of the students tested in the stan¬ 
dardization sample should by definition fall 
below grade level. As another example, a score 
scale point originally defined as the mean of 
some reference population should no longer be 
interpreted as representing average perform¬ 
ance if the scale is held constant over time and 
the examinee population changes. 

Standard 4.4 

When raw scores are intended to be directly 
interpretable, their meanings, intended 
interpretations, and limitations should be 
described and justified in the same manner 
as is done for derived score scales. 

Comment: In some cases the items in a test 
are a representative sample of a well-defined 
domain of items. The proportion correct on 
the test may then be interpreted as an estimate 
of the proportion of items in the domain that 
could be answered correctly. In other cases, 
different interpretations may be attached to 
scores above or below one or another cut score. 
Support should be offered for any such inter¬ 
pretations recommended by the test developer. 

Standard 4.5 

Norms, if used, should refer to clearly 
described populations. These populations 
should include individuals or groups to 
whom test users will ordinarily wish to 
compare their own examinees. 

Comment: It is the responsibility of test develop¬ 
ers ro describe norms dearly and the responsibil¬ 
ity of test users to employ norms appropriately. 
Users need to know the applicability of a test to 
different groups. Differentiated norms or sum¬ 


mary information about differences between 
gender, ethnic, language, disability, grade, or 
age groups, for example, may be useful in some 
cases. The permissible uses of such differenti¬ 
ated norms and related information may be 
limited by law. Users also need ro be made alert 
to situations in which norms are less appropri¬ 
ate for some groups or individuals than others. 
On an occupational interest inventory, for 
example, norms for persons actually engaged 
in an occupation may be inappropriate for 
interpreting the scores of persons not so 
engaged. As another example, the appropri¬ 
ateness of norms for personality inventories 
or relationship scales may differ depending 
upon an examinees sexual orientation. 

Standard 4.6 

Reports of norming studies should include 
precise specification of the population that 
was sampled, sampling procedures and par¬ 
ticipation rates, any weighting of the sample, 
the dates of testing, and descriptive statistics. 
The information provided should be sufficient 
to enable users to judge the appropriateness of 
the norms for interpreting the scores of local 
examinees. Technical documentation should 
indicate the precision of the norms themselves. 

Comment: Scientific sampling is important if 
norms are to be representative of intended 
populations. For example, schools already 
using a given published test and volunteering 
to participate in a norming study for that test 
should not be assumed to be representative of 
schools in general. In addition to sampling pro¬ 
cedures, participation rates should be reported, 
and the method of calculating patticipation 
rates should be dearly described. Studies that are 
designed to be nationally representative often 
use weights so that the weighted sample better 
represents the nation than does the unweighted 
sample. When weights are used, it is important 
that the procedure for deriving the weights be 
described and chat the demographic representa- 
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lion of both the weighted and the unweighted 
samples be given. If norming data are collect¬ 
ed under conditions in which student motiva¬ 
tion in completing the test is likely to differ 
from that expected during operational use, this 
should be clearly documented- Likewise, if the 
instructional histories of students in the norm¬ 
ing sample differ systematically from those to 
be expected during operational test use, that 
(act should be noted. Norms based on samples 
cannot be perfectly precise. Even though the 
imprecision of norm-referenced interpretadons 
due to imperfections in the norms themselves 
may be small compared to that due to meas¬ 
urement error, estimates of the precision of 
norms should be available in technical docu¬ 
mentation. For example, standard errors based 
on the sample design might be presented. In 
some testing applications, norms based on all 
examinees rested over a given period of time 
may be useful for some purposes. Such norms 
should be clearly characterized as being based 
on a sample of persons routinely tested as part 
of an ongoing testing program. 

Standard 4.7 

If local examinee groups differ materially 
from the populations to which norms refer, a 
user who reports derived scores based on the 
published norms has the responsibility to 
describe such differences if they bear upon 
the interpretation of the reported scores. 

Comment: In employment settings, the qualifi¬ 
cations of local examinee groups may fluctuate 
depending on recruitment or referral proce¬ 
dures as well as marker conditions. In such 
cases, appropriate test use and interpretation 
may nor require documentation or cautions 
concerning departures from characteristics of 
the norming population. 

Standard 4.8 

When norms are used to characterize exam¬ 
inee groups, the statistics used to summarize 


each groups performance and the norms to 
which those statistics are referred should be 
clearly defined and should support the 
intended use or interpretation. 

Comment: Group means are distributed dif¬ 
ferently from individual scores. For example, 
it is not possible to determine the percentile 
rank of a school’s average test score if all that is 
known ate the percentile ranks of each of that 
school’s students. It may sometimes be useful to 
develop special norms for group means, but 
when the sizes of the groups differ materially 
or when some groups are much more heteroge¬ 
neous than others, the construction and inter¬ 
pretation of group norms is problematical. One 
common and acceptable procedure is to report 
the percentile rank of the median group 
member, for example, the median percentile 
rank of the pupils rested in a given school. 

Standard 4.9 

When raw score or derived score scales are 
designed for criterion-referenced interpreta¬ 
tion, including the classification of exam¬ 
inees into separate categories, the rationale 
for recommended score interpretations 
should be clearly explained. 

Comment: Criterion-referenced interpretations 
are score-based descriptions or inferences that 
do not take the form of comparisons to the test 
performance of other examinees. Examples 
include statements that some psychopathology 
is likely present, that a prospective employee 
possesses specific skills required in a given posi¬ 
tion, or that a child scoring above a certain score 
point can successfully apply a given set of skills. 
Such interpretations may refer to the absolute 
levels of test scores or to patterns of scores for 
an individual examinee. Whenever the tesr 
developer recommends such interpretations, 
the rationale and empirical basis should be 
clearly presented. Serious efforts should be 
made whenever possible to obtain independent 
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evidence concerning the soundness of such 
score interpretations. Criterion-referenced 
and norm-referenced scales are not mutually 
exclusive. Given adequate supporting data, 
scores may be interpreted by both approaches, 
not necessarily just one or the other. 

Standard 4.10 

A clear rationale and supporting evidence 
should be provided for any claim that scores 
earned on different forms of a test may be 
used interchangeably. In some cases, direct 
evidence of score equivalence may be provid¬ 
ed. In other cases, evidence may come from 
a demonstration that the theoretical assump¬ 
tions underlying procedures for establishing 
score comparability have been sufficiently sat¬ 
isfied. The specific rationale and the evidence 
required will depend in part on the intended 
uses for which score equivalence is claimed. 

Comment; Support should be provided for any 
assertion that scores obtained using different 
items or testing materials, or different testing 
procedures, are interchangeable for some pur¬ 
pose. This standard applies, for example, to 
alternate forms of a paper-and-pencil test ot 
to alternate sets of items taken by different 
examinees in computerized adaptive testing. 

It also applies to test forms administered in 
different formats (e.g., paper-and-pencil and 
computerized tests) ot test forms designed for 
individual versus group administration. Score 
equivalence is easiest to establish when differ¬ 
ent forms are constructed following identical 
procedures and then equated statistically When 
that is nor possible, for example, in cases where 
different test formats are used, additional evi¬ 
dence may be required to establish the requisite 
degree of score equivalence for the intended 
context and purpose. When recommended 
inferences or actions are based solely on classifi¬ 
cations of examinees into one of two or more 
categories, the rationale and evidence should 
address consistency of classification. If the only 


score reported and used is a pass-fail decision, 
for example, then the form-to-form equiva¬ 
lence of measurements for examinees fax above 
or far below the cut score is of no concern. 
Some testing accommodations may only affect 
the dependence of test scores on capabilities 
irrelevant to the construct the test is intended 
to measure. Use of a large-print edition, for 
example, assures that performance does nor 
depend on the ability ro perceive standard-size 
print. In such cases, relatively modest studies 
or professional judgment may be sufficient to 
support claims of score equivalence 

Standard 4.11 

When claims of form-to-form score equiva¬ 
lence are based on equating procedures, 
detailed technical information should be 
provided on the method by which equating 
(unctions or other liakages were established 
and on the accuracy of equating (unctions. 

Comment: The fundamental concern is to 
show that equared scores measure essentially 
the same construct, with very similar levels of 
reliability and conditional standard errors of 
measurement. Technical information should 
include the design of equating scudi^, the 
statistical methods used, the size and relevant 
characteristics of examinee samples used in 
equating studies, and the characteristics of any 
anchor tests or linking items. Standard errors 
of equating functions should be estimated and 
reported whenever possible. Sample sizes per¬ 
mitting, it may be informative ro determine 
equating functions independendy for identifi¬ 
able subgroups of examinees. It may also be 
informative to use two anchor forms and to 
conduct the equating using each of the anchors. 
In some cases, equating functions may be deter¬ 
mined independendy using different statistical 
methods. The correspondence of separate func¬ 
tions obtained by such methods can lend sup¬ 
port to the adequacy of the equadng results. Any 
substantial disparities found by such methods 
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should be resolved or reporred. To be most 
useful, equating error should be presented in 
units of rhe reported score scale. For testing 
programs with cut scores, equating error near 
the cut score is of primary importance. The 
degree of scrutiny of equating (unctions should 
be commensurate with rhe extent of test use 
anticipated and the importance of the deci¬ 
sions the test scores are intended to inform. 

Standard 4.12 

Jn equating studies that rely on the statisti¬ 
cal equivalence of examinee groups receiving 
different forms, methods of assuring such 
equivalence should be described in detail. 

Comment: Certain equating designs rely on the 
random equivalence of groups receiving different 
forms. Often, one way to assure such equivalence 
is to systematically mix different test forms and 
then distribute them in a random fashion so 
that roughly equal numbers of examinees in 
each group tested receive each form. 

Standard 4.13 

In equating studies that employ an anchor 
test design, the characteristics of the anchor 
test and its similarity to the forms being 
equated should be presented, including both 
content specifications and empirically deter¬ 
mined relationships among test scores. If 
anchor items are used, as in some IRT-based 
and classical equating studies, the represen¬ 
tativeness and psychometric characteristics 
of anchor items should be presented. 

Comment: Tests or test forms may be linked 
via common items embedded within each of 
them, or a common test administered togeth¬ 
er with each of them. These common items 
or tests are referred to as linking items, anchor 
items, or anchor rests. With such methods, 
the quality of the resulting equating depends 
strongly on the adequacy of the anchor tests 
or items used. 


Standard 4.14 

When score conversions or comparison pro¬ 
cedures ate used to relate scores on tests or 
test forms that are not closely parallel, the 
construction, intended interpretation, and 
limitations of those conversions or compar¬ 
isons should be clearly described. 

Comment: Various score conversions or con¬ 
cordance tables have been constructed relating 
tests at different levels of difficulty, relating 
earlier to revised forms of published rests, cre¬ 
ating score concordances between different 
tests of similar or different constructs, or for 
other purposes. Such conversions are often 
useful, but they may also be subject to misin¬ 
terpretation. The limitations of such conver¬ 
sions should be clearly described. 

Standard 4.15 

When additional test forms are created by tak¬ 
ing a subset of die items in an existing test form 
or by rearranging its items and there is sound 
reason to believe that scores on these forms 
may be influenced by item context effects, 
evidence should be provided that there is no 
undue distortion of norms for the different 
versions or of score linkages between them. 

Comment: Some tests and test batteries are 
published in both a full-length version and a 
survey or short version. In other cases, multi¬ 
ple versions of a single test form may be cre¬ 
ated by rearranging its items. It should not be 
assumed that performance data derived from 
the administration of items as parr of the ini¬ 
tial version can be used to approximate norms 
or construct conversion rabies for alternative 
intact tests. Due caution is required in cases 
where context effects are likely, including 
speeded tests, long rests where fatigue may be 
a factor, and so on. fn many cases, adequate 
psychometric data may only be obtainable 
from independent administrations of the 
alternate forms. 
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standards! 


Standard 4.16 

If test specifications are changed from one 
version of a test to a subsequent version, such 
changes should be identified in the test man¬ 
ual, and an indication should be given that 
converted scores for the two versions may not 
be strictly equivalent. When substantial 
changes in test specifications occur, either 
scores should be reported on a new scale or 
a clear statement should be provided to alert 
users that the scores are not directly compara¬ 
ble with those on earlier versions of the test. 

Comment: Major shifts sometimes occur in the 
specifications of tests that ate used for substan¬ 
tia) periods of time. Often such changes take 
advantage of improvements in item types or 
of shifts in content that have been shown to 
improve validity and, therefore, are highly 
desirable. It is important to recogniac, howev¬ 
er, that such shifts will result in scores that 
cannot be made striedy interchangeable with 
scores on an earlier form of the test. 

Standard 4.17 

Testing programs that attempt to maintain 
a common scale over time should conduct 
periodic checks of the stability of the scale 
on which scores are reported. 

Comment: In some testing programs, items are 
introduced into and retired from item pools on 
an ongoing basis. In other cases, the items in suc¬ 
cessive test forms may overlap very little, or not 
at all. In either case, if a fixed scale is used for re¬ 
porting, it is important to assure that the mean¬ 
ing of the scaled scores does not change over time. 

Standard 4.18 

If a publisher provides norms for use in rest 
score interpretation, then so long as the lest 
remains in print, it is the publisher’s responsi¬ 
bility to assure that the test is renormed with 
sufficient frequency to permit continued accu¬ 
rate and appropriate score interpretations. 


Comment: Test publishers should assure that 
up-to-date norms are readily available, but it 
remains the test user's responsibility to avoid 
inappropriate use of norms that are out of date 
and to strive to assure accurate and appropri¬ 
ate test interpretations. 

Standard 4.19 

When proposed score interpretations involve 
one or more cut scores, the rationale and 
procedures used for establishing cut scores 
should be dearly documented. 

Comment: Cut scores may be established to 
select a specified number of examinees (e.g., 
to fill existing vacancies), in which case little 
further documentation may be needed con¬ 
cerning the specific question of how the cut 
scores are established, though attention should 
be paid to legal requirements that may apply. 
In other cases, however, cut scores may be used 
to classify examinees into distinct categories 
(e.g., diagnostic categories, or passing versus 
failing) for which there are no preestablished 
quotas. In these cases, the standard-setting 
method must be clearly documented. Ideally, 
the role of cut scores in test use and interpre¬ 
tation is taken into account during test design. 
Adequate precision in regions of score scales 
where cut points are established is prerequisite 
to reliable classification of examinees into cat¬ 
egories. If standard setting employs data on the 
score distributions for criterion groups or on 
the relation of test scores to one or more criteri¬ 
on variables, those data should be summarized 
in technical documentation. If a judgmental 
standard-setting process is followed, the method 
employed should be clearly described, and the 
precise nature of the judgments called for should 
be presented, whether those are judgments of 
persons, of item or test performances, or of 
other criterion performances predicted by test 
scores. Documentation should also include the 
selection and qualification of judges, training 
provided, any feedback to judges concerning 
the implications of their provisional judgments, 
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and any opportunities for judges to confer with 
one another. Where applicable, variability over 
judges should be reported. Whenever feasible, an 
estimate should be provided of the amount of 
variation in cut scores that might be expected if 
the standard-setting procedure were replicated. 

Standard 4.20 

When feasible, cut scores defining categories 
with distinct substantive interpretations 
should be established on the basis of sound 
empirical data concerning the relation of test 
performance to relevant criteria. 

Comment . In employment settings, although 
it is important to establish that test scores are 
related to job performance, the precise rela¬ 
tion of test and criterion may have little bear¬ 
ing on the choice of a cut score. However, in 
contexts where distinct interpretations are 
applied to different score categories, the 
empirical relation of test to criterion assumes 
greater importance. Cut scores used in inter¬ 
preting diagnostic tests may be established on 
the basis of empirically determined score dis¬ 
tributions for criterion groups. With achieve¬ 
ment or proficiency tests, such as those used 
in licensure, suitable criterion groups (e.g., 
successful versus unsuccessful practitioners) 
are often unavailable. Nonetheless, it is highly 
desirable, when appropriate and feasible, to 
investigate the relation between test scores 
and performance in relevant practical settings. 
Note that a carefully designed and imple¬ 
mented procedure based solely on judgments 
of content relevance and item difficulty may 
be preferable to an empirical study with an 
inadequate criterion measure or other defi¬ 
ciencies. Professional judgment is required 
to determine an appropriate standard-setting 
approach (or combination of approaches) in 
any given situation. In general, one would 
nor expect to find a sharp difference in levels 
of the criterion variable between those just 


below versus just above the cut score, but evi¬ 
dence should be provided where feasible of a 
relationship between test and criterion per¬ 
formance over a score interval that includes 
or approaches the cut score. 

Standard 4.21 

When cut scores defining pass-fail or profi¬ 
ciency categories are based on direct judg¬ 
ments about the adequacy of item or test 
performances or performance levels, the 
judgmental process should be designed so 
that judges can bring their knowledge and 
experience to bear in a reasonable way. 

Comment: Cut scores are sometimes based on 
judgments about the adequacy of item or test 
performances (e.g., essay responses to a writ¬ 
ing prompt) or performance levels (e.g., the 
level that would characterize a borderline 
examinee). The procedures used to elicit such 
judgments should result in reasonable, defensi¬ 
ble standards that accurately reflect the judges’ 
values and intentions. Reaching such judgments 
may be most straightforward when judges are 
asked to consider kinds of performances with 
which they are familiar and for which they 
have formed clear conceptions of adequacy or 
quality. When the responses clicked by a test 
neither sample nor closely simulate the use of 
tested knowledge or skills in che actual criteri¬ 
on domain, judges are nor likely to approach 
the task with such clear understandings. Special 
care must then be taken to assure that judges 
have a sound basis for making the judgments 
requested. Thorough familiarity with descrip¬ 
tions of different proficiency categories, prac¬ 
tice in judging task difficulty with feedback 
on accuracy, the experience of actually taking 
a form of the test, feedback on the failure 
rates entailed by provisional standards, and 
other forms of information may be beneficial 
in helping judges to reach sound and princi¬ 
pled decisions. 
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5. TEST ADMINISTRATION, SCORING, 
AND REPORTING 


Background 

The usefulness and interpretability of rest 
scores require that a test be administered and 
scored according ro the developers instruc- 
rions. When directions to examinees, resting 
conditions, and scoring procedures follow the 
same detailed procedures, the test is said to be 
standardized. Without such standardization, 
the accuracy and comparability of score inter¬ 
pretations would be reduced. For tests designed 
to assess the examinee's knowledge, skills, or 
abilities, standardization helps to ensure that 
all examinees have the same opportunity to 
demonstrate their competencies. Maintaining 
test security also helps to ensure that no one 
has an unfair advantage. 

Occasionally, however, situations arise in 
which modifications of standardized procedures 
may be advisable or legally mandated. Persons 
of different backgrounds, ages, or familiarity 
with testing may need nonstandard modes of 
test administration or a more comprehensive 
orientation to the testing process, in order that 
all test takers can come to the same under¬ 
standing of the task. Standardized modes of 
presenting information or of responding may 
not be suitable for specific individuals, such 
as persons with some kinds of disability, or 
peisons with limited proficiency in the language 
of the test, so that accommodations may be 
needed (see chapters 9 and 10). Large-scale 
resting programs generally have established 
specific procedures to be used in considering 
and granting accommodations. Some test users 
feel that any accommodation not specifically 
required by law could lead to a charge of 
unfair treatment and discrimination. Although 
accommodations are made with the intent of 
maintaining score comparability, the extent 
to which that is possible may not be known. 
Comparability of scores may be compromised, 
and the test may then not measure the same 
constructs for all test taken. 


Tests and assessments differ in their degree 
of standardization. In many instances different 
examinees are given not the same test form, but 
equivalent forms that have been shown to yield 
comparable scores. Some assessments permit 
examinees to choose which tasks to perform or 
which pieces of their work arc to be evaluated. 
A degree of standardization can be maintained 
by specifying the conditions of the choice and 
the criteria of evaluation of the products. When 
an assessment permits a certain kind of collabo¬ 
ration, the limits of that collaboration can be 
specified. With some assessments, test adminis¬ 
trators may be expected to tailor their instruc¬ 
tions to help assure that all examinees understand 
what is expected of them. In all such cases, the 
goal remains the same-, to provide accurate and 
comparable measurement for everyone, and 
unfair advantage to no one. The degree of 
standardization is dictated by that goal, and 
by the intended use of the test. 

Standardized directions to test takers 
help to ensure that ail test takers understand 
the mechanics of test taking. Directions gen¬ 
erally inform rest takers how ro make their 
responses, what kind of help they may legiti¬ 
mately be given if they do not understand 
the question or task, how they can correct 
inadvertent responses, and the nature of any 
time constraints. General advice is some¬ 
times given about omitting item responses. 
Many tests, including computer-administered 
rests, require special equipment Practice exer¬ 
cises are often presented in such cases to ensure 
that the test taker undersands how to operate 
the equipment. The principle of standardiza¬ 
tion includes orienting test rakers to materials 
with which they may not be familiar. Some 
equipment may be provided at the testing site, 
such as shop tools or balances. Opportunity 
for test rakers ro practice with the equipment 
will often be appropriate, unless using the 
equipment is the purpose of rhe test. 
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Tests arc sometimes administered by 
computer, with test responses made by key¬ 
board, computer mouse, or similar device. 
Although many rest rakers are accustomed 
to computers, some are not and may need 
some brief explanation. Even those test tak¬ 
ers who use computers will need to know 
about some details. Special issues arise in 
managing the testing environment, such as 
the arrangement of illumination so that 
light sources do not reflect on the computer 
screen, possibly interfering with display leg¬ 
ibility. Maintaining a quiet environment 
can be challenging when candidates arc test¬ 
ed separately, starting at different times and 
finishing at different times from neighbor¬ 
ing test rakers. Those who administer com¬ 
puter-based tests require training in the 
hardware and software used for the test, so 
that they can deal with problems that may 
arise in human-computer interactions. 

Standardized scoring procedures help 
to ensure accurate scoring and reporting, 
which are essential in all circumstances. When 
scoring is done by machine, the accuracy of 
the machine is at issue, including any scoring 
algorithm. W r hen scoring is done by human 
judges, scorers require careful training. Regular 
monitoring can also help to ensure that every 
rest protocol is scored according ro the same 
standardized criteria and that the criteria do 
nor change as the test scorers progress through 
the submitted test responses. 

Test scores, per se, are not readily inter¬ 
preted without other information, such as 
norms or standards, indications of measure¬ 
ment error, and descriptions of test content. 
Just as a temperature of 50' in January is 
warm for Minnesota and cool for Florida, a 
test score of 50 is not meaningful without 
some context. When the scares are to be 
reported to persons who are not technical 
specialists, interpretive material can be pro¬ 
vided that is readily understandable to those 
receiving the report. Often, the test user 


provides an interpretation of the results for 
the test taker, suggesting the limitations of 
the results and the relationship of any reported 
scores to other information. Scores on some 
tests arc not designed to be released to test 
takers; only broad test interpretations, or 
dichotomous classifications, such as pass/fail, 
are intended to be reported. 

Interpretations of test results are some¬ 
times prepared by computer systems. Such 
interpretations are generally based on a com¬ 
bination of empirical data and expert judg¬ 
ment and experience. In some professional 
applications of individualized testing, the 
computer-prepared interpretations are com¬ 
municated by a professional, possibly with 
modifications for special circumstances. 
Such test interpretations require validation. 
Consistency with interpretations provided by 
nonalgorithmic approaches is dearly a concern. 

In some large-scale assessments, the pri¬ 
mary target of assessment is not the individ¬ 
ual test taker but is a larger unit, such as a 
school district or an industrial plant. Often, 
different test takers are given different sets 
of items, following a carefully balanced matrix 
sampling plan, to broaden the range of infor¬ 
mation that can be obtained in a reasonable 
time period. The results acquire meaning 
when aggregated over many individuals taking 
different samples of items. Such assessments 
may not furnish enough information to sup¬ 
port even minimally valid, reliable scores for 
individuals, as each individual may take only 
an incomplete test. 

Some further issues of administration 
and scoring arc discussed in chapter 3, "Tesi 
Development and Revision." 


62 


AERA_APA_NCME_0000072 


JA2668 


Case l:14-cv-00857-TSC Document 60-85 Filed 12/21/15 Page 74 of 100 
USCA Case #17-7035 Document #1715850 Filed: 01/31/2018 Paae365of517 

PART I / TEST ADMINISTRATION, SCORING, AND REPORTING STANDARDS] 


Standard 5.1 

Test administrators should follow carefully 
the standardized procedures for administra¬ 
tion and scoring specified by the test devel¬ 
oper, unless the situation or a test taker's 
disability dictates that an exception should 
be made. 

Comment: Specifications regarding instruc¬ 
tions ro test takers, time limits, the form of 
item presentation or response, and test mate¬ 
rials or equipment should be strictly observed. 
In general, the same procedures should be 
followed as were used when obtaining the 
data for scaling and nooning the test scores. 
A test taker with a disabling condition may 
require special accommodation. Other special 
circumstances may require some flexibility in 
administration. Judgments of the suitability 
of adjustments should be tempered by the 
consideration that departures from standard 
procedures may jeopardize the validity of the 
rest score interpretations. 

Standard 5.2 

Modifications or disruptions of standardized 
test administration procedures or scoring 
should be documented. 

Comment: Information about the nature of 
modifications of administration should be 
maintained in secure data files, so chat research 
studies or case reviews based on test records 
can take this into account. This includes not 
only special accommodations for particular 
test takers, but also disruptions in the testing 
environment chat may affect all test takers in 
the testing session. A researcher may wish to 
use only the records based on standardized 
administration. In other cases, research srud- 
ies may depend on such information to form 
groups of respondents. Test users or test spon¬ 
sors should establish policies concerning who 
keeps the files and who may have access to 
the files. Whether the information about 


modifications is reported to users of test data, 
such as admissions officers, depends on dif¬ 
ferent considerations (see chapters 8 and 10). 
If such reports are made, certain cautions may 
be appropriate. 

Standard 5.3 

When formal procedures have been estab¬ 
lished for requesting and receiving accom¬ 
modations, test takers should be informed 
of these procedures in advance of testing. 

Comment: When large-scale testing programs 
have established strict procedures to be fol¬ 
lowed, administrators should not depart from 
these procedures. 

Standard 5.4 

The testing environment should furnish rea¬ 
sonable comfort with minimal distractions. 

Comment: Noise, disruption in the testing 
area, extremes of temperature, poor lighting, 
inadequate work space, illegible materials, 
and so forth are among the conditions that 
should be avoided in testing situations. The 
testing site should be readily accessible. 
Testing sessions should be monitored where 
appropriate to assist the test taker when a 
need arises and to maintain proper adminis¬ 
trative procedures. In general, the testing 
conditions should be equivalent to those that 
prevailed when norms and other interpreta¬ 
tive data were obtained. 

Standard 5.5 

Instructions to test takers should clearly 
indicate how to make responses. Instructions 
should also be given in the use of any equip¬ 
ment likely to be unfamiliar to test takers. 
Opportunity to practice responding should 
be given when equipment is involved, unless 
use of the equipment is being assessed. 
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Comment: When electronic calculators are pro¬ 
vided for use, examinees may need practice in 
using the calculator. Examinees may need 
practice responding with unfamiliar tasks, such 
as a numeric grid, which is sometimes used with 
mathematics performance items. In computer- 
administered tests, the method of responding 
may be unfamiliar to some test takers. Where 
possible, the practice responses should be mon¬ 
itored to ensure that the test taker is making 
acceptable responses. In some performance tests 
that involve tools or equipment, instructions may 
be needed for unfamiliar tools, unless accommo¬ 
dating to unfamiliar tools is pan of what is being 
assessed. If a test taker is unable to use the equip¬ 
ment or make the responses, it may be appropri¬ 
ate to consider alternative testing modes. 

Standard 5.6 

Reasonable efforts should be made to assure 
the integrity of test scores by eliminating 
opportunities for test takers to attain scores 
by fraudulent means. 

Comment: In large-scale testing programs where 
rhe results may be viewed as having important 
consequences, efforts to assure score integrity 
should include, when appropriate and practi¬ 
cable, stipulating requirements for identifica¬ 
tion, constructing seating charts, assigning 
test takers to seats, requiring appropriate space 
between seats, and providing continuous 
monitoring of the testing process. Test devel¬ 
opers should design test materials and proce¬ 
dures to minimize the possibility of cheating. 
Test administrators should note and report 
any significant instances of testing irregularity. 

A local change in the date or time of testing 
may offer an opportunity for fraud. In gener¬ 
al, steps should be taken to minimize the pos¬ 
sibility of breaches in test security. In any 
evaluation of work products (e.g., portfolios! 
steps should be taken to ensure that the prod¬ 
uct represents the candidate’s own work, and 
that the amount and kind of assistance pro¬ 
vided should be consistent with the intent of 
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the assessment. Ancillary documentation, 
such as the date when the work was done, 
may be useful. 

Standard 5.7 

Test users have the responsibility of protect¬ 
ing the security of test materials at all times. 

Comment: Those who have test materials 
under their control should, with due consid¬ 
eration of ethical and legal requirements, take 
all steps necessary to assure that only individ¬ 
uals with a legitimate need for access to tesi 
materials are able to obtain such access before 
the test administtation, and afterwards as 
well, if any parr of the test will be reused at a 
later rime. Test users must balance test securi¬ 
ty with the rights of all test takers and test 
users. When sensitive test documents are 
challenged, it may be appropriate to employ 
an independent third party, using a closely 
supervised secure procedure to conduct a 
review of the relevant materials. Such secure 
procedures are usually preferable to placing 
tests, manuals, and an examinees test respons¬ 
es in the public record. 

Standard 5.8 

Test scoring services should document the 
procedures that were followed to assure 
accuracy of scoring. The frequency of scor¬ 
ing errors should be monitored and reported 
to users of the service on reasonable request. 
Any systematic source of scoring errors 
should be corrected. 

Comment: Clerical and mechanical errors 
should be examined. Scoring errors should 
be minimized and, when they are found, 
steps should be taken promptly to minimize 
theit recurrence. 

Standard 5.9 

When test scoring involves human judgment, 
scoring rubrics should specify criteria for scor- 
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ing. Adherence to established scoring criteria 
should be monitored and checked regularly. 
Monitoring procedures should be documented. 

Comment: Human scorers may be provided 
with scoring rubrics listing acceptable alterna¬ 
tive responses, as well as general criteria. 
Consistency of scoring is often checked by 
rescoring randomly selected test responses 
and by rescoring some responses from earlier 
administrations. Periodic checks of the statis¬ 
tical properties (e.g., means, standard devia¬ 
tions) of scores assigned by individual scorers 
during a scoring session can provide feedback 
for the scorers, helping them to maintain 
scoring standards. Lack of consistent scoring 
may call for retraining or dismissing some scor¬ 
ers or for reexamining the scoring rubrics. 

Standard 5.10 

When test score information is released to 
students, parents, legal representatives, teach¬ 
ers, clients, or the media, those responsible 
for testing programs should provide appro¬ 
priate interpretations. The interpretations 
should describe in simple language what the 
test covers, what scores mean, the precision 
of the scores, common misinterpretations of 
test scores, and how scores will be used. 

Comment: Test users should consult the inter¬ 
pretive material prepared by the test developer 
or publisher and should revise or supplement 
the material as necessary to present the local and 
individual results accurately and clearly. Score 
precision might be depicted by error bands, 
or likely score ranges, showing the standard 
error of measurement. 

Standard 5.11 

When computer-prepared interpretations of 
test response protocols are reported, the 
sources, rationale, and empirical basis for 
these interpretations should be available, 
and their limitations should be described. 


Comment: Whereas computer-prepared inter¬ 
pretations may be based on expert judgment, 
the interpretations are of necessity based 
on accumulated experience and may not be 
able to take into consideration the context of 
the individual’s circumstances. Computer- 
prepared interpretations should be used with 
care in diagnostic settings, because they 
may not take into account other information 
about the individual test taker, such as age. 
gender, education, prior employment, and 
medical history, that provide context for 
test results. 

Standard 5.12 

When group-level information is obtained 
by aggregating the results of partial tests 
taken by individuals, validity and reliability 
should be reported for the level of aggrega¬ 
tion at which results are reported. Scores 
should not be reported for individuals unless 
the validity, comparability, and reliability of 
such scores have been established. 

Comment: Large-scale assessments often 
achieve efficiency by “matrix sampling” of 
the concent domain by asking different test 
takers different questions. The testing then 
requires less time from each test taker, while 
the aggregation of individual results provides 
for domain coverage that can be adequate 
for meaningful group- or program-level 
interpretations, such as schools, or grade 
levels within a locality or particular subject- 
matter areas. Because the individual receives 
only an incomplete test, an individual score 
would have limired meaning. If individual 
scores are ptovided, comparisons beeween 
scores obtained by different individuals are 
based on responses to items that may cover 
different material. Some degree of calibra¬ 
tion among incomplete tests can sometimes 
be made. Such calibration is essential to the 
comparisons of individual scores. 
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Standard 5.13 

Transmission of individually identified test 
scores to authorized individuals or institu¬ 
tions should be done in a manner that pro¬ 
tects the confidential nature of the scores. 

Comment: Care is always needed when com¬ 
municating the scores of identified test takers, 
regardless of the form of communication. 
Face-to-face communication, as well as tele¬ 
phone and written communication present 
well-known problems. Transmission by elec¬ 
tronic media, including computet networks 
and facsimile, presents modern challenges 
to confidentiality. 

Standard 5.14 

When a material error is found in test scores 
or other important information released by a 
testing organization or other institution, a 
corrected score report should be distributed 
as soon as practicable to all known recipients 
who might otherwise use the erroneous scores 
as a basis for decision making. The corrected 
report should be labeled as such. 

Comment: A material error is one that could 
change the interpretation of the test score. 
Innocuous typographical errors would be 
excluded. Timeliness is essential for decisions 
that will be made soon after the test scores 
are received. 

Standard 5.15 

When test data about a person are retained, 
both the test protocol and any written 
report should also be preserved in some 
form. Test users should adhere to the poli¬ 
cies and record-keeping practice of their 
professional organizations. 

Comment: The protocol may be needed to 
respond to a possible challenge from a test 
taker. The protocol would ordinarily be 


accompanied by testing materials and test 
scores. Retention of mote detailed records of 
responses would depend on circumstances 
and should be covered in a retention policy 
(see the following standard). Record keeping 
may be subject to legal and professional 
requirements. Policy for the release of any test 
information for other than research purposes 
is discussed in chapter 8. 

Standard 5.16 

Organizations that maintain test scores on 
individuals in data files or in an individual’s 
records should develop a clear set of policy 
guidelines on the duration of retention of an 
individual’s records, and on the availability, 
and use over time, of such data. 

Comment: In some instances, test scores 
become obsolete over rime, no longer 
reflecting the current state of the test taker. 
Outdated scores should generally not be used 
or made available, except for research purpos¬ 
es. In other cases, test scores obtained in past 
years can be useful as, for example, in longi¬ 
tudinal assessment. The key issue is the valid 
use of the information. Score retention and 
disclosure may be subject to legal and profes¬ 
sional requirements. 
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6. SUPPORTING DOCUMENTATION 
FOR TESTS 


Background 

The provision of supporting documents for 
tests is the primary means by which test 
developers, publishers, and distributors com¬ 
municate with test users. These documents 
are evaluated on the basis of their complete¬ 
ness, accuracy, currency, and clarity and 
should be available to qualified individuals as 
appropriate. A rest's documentation typically 
specifies the nature of the test; its intended 
use; the processes involved in the test’s devel¬ 
opment; technical information related to 
scoring, interpretation, and evidence of valid¬ 
ity and reliability; scaling and normtng if 
appropriate to the instrument; and guidelines 
for rest administration and interpretation. 
The objective of the documentation is to pro¬ 
vide tesr users with the information needed to 
make sound judgments about the nature and 
quality of the test, the resulting scores, and 
the interpretations based on the test scores. 
The information may be reported in docu¬ 
ments such as test manuals, technical manu¬ 
als, user’s guides, specimen sets, examination 
kits, directions for test administrators and 
scorers, or preview materials for test takers. 

Test documentation is most effective if it 
communicates information to multiple user 
groups. To accommodate rhe breadth of 
training of professionals who use tests, sepa¬ 
rate documents or sections of documents may 
be written for identifiable categories of users 
such as practitioners, consultants, administra¬ 
tors, researchers, and educators. For example, 
the tesr user who administers the tests and 
interprets the results needs interpretive infor¬ 
mation or guidelines. On the other hand, 
those who are responsible for selecting tests 
need to be able to judge the technical adequa¬ 
cy of the test. Therefore, some combination 
of technical manuals, user's guides, test man¬ 
uals, test supplements, examination kits, or 


specimen sets ordinarily is published to pro¬ 
vide a potential test user or rest reviewer with 
sufficient information to evaluate the appro¬ 
priateness and technical adequacy of the test. 
The types of information presented in these 
documents typically include a description of 
the intended test-taking population, stated 
purpose of the test, resr specifications, item 
formats, scoring procedures, and the test 
development process. Technical data, such as 
psychometric indices of the items, reliability 
and validity evidence, normative data, and 
cut scores or configural rules including those 
for computer-generated interpretations of test 
scores also are summarized. 

An essential feature of the documentation 
for every test is a discussion of the known 
appropriate and inappropriate uses and inter¬ 
pretations of the test scores. The inclusion of 
illustrations of score interpretations, as they 
relate to the test developer's intended applica¬ 
tions, also will help users make accurate infer¬ 
ences on the basis of the test scores. When 
possible, illustrations of improper test uses and 
inappropriate test score interpretations will 
help guard against the misuse of the tesr. 

Test documents need to include enough 
information to allow test users and reviewers 
to determine the appropriateness of rhe test 
for its intended purposes. References to other 
materials that provide more details about 
research by the publisher or independent 
investigators should be cited and should be 
readily obtainable by the test user or reviewer. 
This supplemental material can be provided 
in any of a variety of published or unpub¬ 
lished forms; when demand is likely to be 
low, it may be maintained in archival form, 
including electronic storage. Test documenta¬ 
tion is useful for all test instruments, includ¬ 
ing those that are developed exclusively for 
use within a single organization. 
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In addition to technical documentation, 
descriptive materials are needed in some set¬ 
tings to inform examinees and other interested 
parties about the nature and content of the 
test. The amount and type of information 
will depend on the particular test and appli¬ 
cation. For example, in situations requiring 
informed consent, information should be suf¬ 
ficient to develop a reasoned judgment. Such 
information should be phrased in nontechni¬ 
cal language and should be as inclusive as is 
consistent with the use of the test scores. The 
materials may include a general description 
and rationale for the test; sample items or 
complete sample tests; and information about 
conditions of test administration, confiden¬ 
tiality, and retention of test results. For some 
applications, however, the true nature and 
purpose of a test are purposely hidden or dis¬ 
guised to prevent faking or response bias. In 
these instances, examinees may be motivated 
to reveal more or less of the characteristics 
intended to be assessed. Under these circum¬ 
stances, hiding or disguising the true nature 
or purpose of the test is acceptable provided 
this action is consistent with legal principles 
and ethical standards. 

This chapter provides general standards 
for the preparation and publication of test 
documentation. The other chapters contain 
specific standards that will be useful to test 
developers, publishers, and distributors in the 
preparation of materials to be included in a 
tesc's documentation. 


Standard 6.1 

Test documents (e.g., test manuals, technical 
manuals, user's guides, and supplemental 
material) should be made available to prospec¬ 
tive test users and other qualified persons at 
the time a test is published or released for use. 

Comment . The test developer or publisher 
should judge carefully which information 
should be included in first editions of the test 
manual, technical manual, or user's guides 
and which information can be provided in 
supplements. For low-volume, unpublished 
rests, the documentation may be relatively brief. 
When the developer is also the user, docu¬ 
mentation and summaries are still necessary. 

Standard 6.2 

Test documents should be complete, accu¬ 
rate, and dearly written so that the intended 
reader can readily understand the content. 

Comment: Test documents should provide 
sufficient detail to permit reviewers and 
researchers to judge or replicate important 
analyses published in che cest manual. For 
example, reporting correlation matrices in 
the test document may allow the test user 
to judge the data upon which decisions and 
conclusions were based, or describing in 
detail the sample and the nature of any factor 
analyses that were conducted will allow the 
test user to replicate reported studies. 

Standard 6.3 

The rationale for the test, recommended 
uses of the test, support for such uses, and 
information that assists in score interpreta¬ 
tion should be documented. Where particu¬ 
lar misuses of a test can be reasonably 
anticipated, cautions against such misuses 
should be specified. 

Comment: Test publishers make every effoti 
to caution test users against known misuses of 
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tests. However, test publishers are not required 
to anticipate all possible misuses of a test. If 
publishers do know of persistent test misuse 
by a test user, extraordinary educational 
efforts may be appropriate. 

Standard 6.4 

The population for whom the test is intended 
and the test specifications should be docu¬ 
mented. If applicable, the item pool and scale 
development procedures should be described 
in the relevant lest manuals. If normative data 
are provided, the norming population should 
be described in terms of relevant demographic 
variables, and the year(s) in which the data 
were collected should be reported. 

Comment: Known limitations of a test for cer¬ 
tain populations also should be clearly delin¬ 
eated in the test documents. In addition, if 
the test is available in more than one language, 
test documents should provide information 
on the translation or adaptation procedures, 
on the demographics of each norming sample, 
and on score interpretation issues for each lan¬ 
guage into which che test has been translated. 

Standard 6.5 

When statistical descriptions and analyses 
that provide evidence of the reliability of 
scores and the validity of their recommended 
interpretations are available, the information 
should be included in the test’s documenta¬ 
tion. When relevant for test interpretation, 
test documents ordinarily should include 
item level information, cur scores and con- 
figural rules, information about raw scores 
and derived scores, normative data, the stan¬ 
dard errors of measurement, and a descrip¬ 
tion of the procedures used to equate 
multiple forms. 

Standard 6.6 

When a test relates to a course of training or 
study, a curriculum, a textbook, or packaged 


instruction, the documentation should include 
an identification and description of the course 
or instructional materials and should indicate 
the year in which these materials were prepared. 

Standard 6.7 

Test documents should specify qualifications 
that are required to administer a test and to 
interpret the test scores accurately. 

Comment: Statements of user qualifications 
need to specify the training, certification, 
competencies, or experience needed to have 
access to a test. 

Standard 6.8 

If a test is designed to be scored or interpre¬ 
ted by test takers, the publisher and test 
developer should provide evidence that the 
test can be accurately scored or interpreted 
by the test taken. Tests that are designed to 
be scored and interpreted by the test taker 
should be accompanied by interpretive 
materials that assist the individual in under¬ 
standing the test scores and that are written 
in language that the test taker can understand. 

Standard 6.9 

Test documents should cite a representative 
set of the available studies pertaining to gen¬ 
eral and specific uses of the test. 

Comment: Summaries of cited studies—exclud¬ 
ing published works, dissertations, or propri¬ 
etary documents—should be made available 
on request to test users and researchers by che 
publisher. 

Standard 6.10 

Interpretive materials for tests, that include 
case studies, should provide examples illus¬ 
trating the diversity of prospective test takers. 

Comment: For some instruments, the presen¬ 
tation of case studies that are intended to 
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assist the user in the interpretation of the test 
scores and profiles also will be appropriate for 
inclusion in the test documentation. For 
example, case studies might cite as appropri¬ 
ate examples of women and men of different 
ages; individuals differing in sexual orienta¬ 
tion; persons representing various ethnic, cul¬ 
tural, or racial groups; and individuals with 
special needs. The inclusion of examples illus¬ 
trating the diversity of prospective test takers 
is not intended to promote interpretation of 
test scores in a manner inconsistent with legal 
requirements that may restrict certain practices 
in some contexts, such as employee selection. 

Standard 6.11 

If a test is designed so that mote than one 
method can be used for administration or 
for recording responses—such as marking 
responses in a test booklet, on a separate 
answer sheet, or on a computer keyboard- 
then the manual should clearly document the 
extent to which scores arising from these 
methods are interchangeable. If the results 
are not interchangeable, this fact should be 
reported, and guidance should be given for 
the interpretation of scores obtained under 
the various conditions or methods of 
administration. 

Standard 6.12 

Publishers and scoring services that offer 
computer-generated interpretations of test 
scores should provide a summary of die evi¬ 
dence supporting the interpretations given. 

Comment: The test user should be informed 
of any cut scores or configural rules necessary 
for understanding computer-generated score 
interpretations. A description of borh the sam¬ 
ples used to derive cut scores or configural rules 
and the methods used to derive the cut scores 
should be provided. When proprietary inter¬ 
ests result in the withholding of cut scores or 
configural rules, the owners of the intellectual 


property are responsible for documenting evi¬ 
dence in support of the validity of computer- 
generated score interpretations. Such evidence 
might be provided, for example, by repotting 
the finding of an independent review of the 
algorithms by qualified professionals 

Standard 6.13 

When substantial changes are made to a 
test, the test's documentation should be 
amended, supplemented, or revised to keep 
information for users current and to provide 
useful additional information or cautions. 

Standard 6.14 

Every test form and supporting document 
should carry a copyright date or publication 
date. 

Comment. During the operational life of a test, 
new or revised test forms may be published, 
and manuals and other materials may be 
added or revised. Users and potential users 
are entitled to know the publication dates of 
various documents that include test norms. 
Communication among researchers is ham¬ 
pered when the particular test documents 
used in experimental studies are ambiguously 
referenced in research reports. 

Standard 6.15 

Test developers, publishers, and distributors 
should provide general information for lest 
users and researchers who may be required 
to determine the appropriateness of an 
intended test use in a specific context. When 
a particular test use cannot be justified, the 
response to an inquiry from a prospective test 
user should indicate this fact dearly. General 
information also should be provided for test 
takers and legal guardians who must provide 
consent prior to a test's administration. 
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7. FAIRNESS IN TESTING AND 
TEST USE 


Background 

This chapter addresses overriding issues of 
fairness in testing. It is intended both to 
emphasize the importance of fairness in all 
aspects of testing and assessment and to serve 
as a conrexr for the technical standards. Later 
chapters address in greater detail some fairness 
issues involving the responsibilities of test 
users, the rights and responsibilities of test 
takers, the testing of individuals of diverse lin¬ 
guistic backgrounds, and the testing of those 
with disabilities. Chapters 12 through 15 also 
address some fairness issues specific to psycho¬ 
logical, educational, employment and creden- 
tialing, and program evaluation applications 
of testing and assessment. 

Concern for fairness in testing is perva¬ 
sive, and the treatment accorded the topic 
here cannot do justice to the complex issues 
involved. A full consideration of fairness 
would explore the many (unctions of testing 
in relation to its many goals, including the 
broad goal of achieving equality of opportu¬ 
nity in our society. It would consider the 
technical properties of tests, the ways test 
results are reported, and the factors that are 
validly or erroneously thought to account 
for patterns of test performance for groups 
and individuals. A comprehensive analysis 
would also examine the regulations, statutes, 
and case law that govern test use and the 
remedies for harmful practices. The Standards 
cannot hope to deal adequately with all these 
broad issues, some of which have occasioned 
sharp disagreement among specialists and 
other thoughtful observers. Rather, the focus 
of the Standards is on those aspects of tests, 
testing, and test use that are the customary 
responsibilities of those who make, use, 
and interpret tests, and that are character¬ 
ized by some measure of professional and 
technical consensus. 


Absolute fairness to every examinee is 
impossible to attain, if for no other reasons 
than the facts that tests have imperfect relia¬ 
bility and thar validity in any particular con¬ 
text is a matter of degree. But neither is any 
alternative selection or evaluation mechanism 
perfectly fair. Properly designed and used, 
tests can and do further societal goals of fair¬ 
ness and equality of opportunity. Serious 
technical deficiencies in test design, use, or 
interpretation should, of course, be addressed, 
but the fairness of testing in any given con¬ 
text must be judged relative to that of feasible 
rest and nontest alternatives. It is general 
practice that large-scale tests are subjected to 
careful review and empirical checks to mini¬ 
mize bias. The amount of explicit attention to 
fairness in the design of well-made tests com¬ 
pares favorably to that of many alternative 
selection or evaluation methods. 

It is also crucial to bear in mind that test 
settings are interpersonal. The interaction of 
examiner with examinee should be profes¬ 
sional, courteous, caring, and respectful. In 
most rearing situations, the roles of examiner 
and examinee are sharply unequal in status. A 
professional’s inferences and reports from test 
findings may markedly impact the life of the 
person who is examined. Attention to these 
aspects of test use and interpretation is no less 
important than mote technical concerns. 

As is emphasized in professional educa¬ 
tion and training, users of tests should be 
alert to the possibility that human issues 
involving examiner and examinee may some¬ 
times affect test fairness. Attention to inter¬ 
personal issues is always important, perhaps 
especially so when examinees have a disability 
or differ from the examiner in ethnic, racial, 
or religious background; in gender or sexual 
orientarion; in socioeconomic status; in age; 
or in other respects that may affect the exam¬ 
inee-examiner interaction. 
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Varying Views of Fairness 

The term fairness is used in many different ways 
and has no single technical meaning. It is pos¬ 
sible that two individuals may endorse fairness 
in testing as a desirable social goal, yet reach 
quite different conclusions about the fairness 
of a given testing program. Outlined below are 
four principal ways in which the term fairness 
is used. It should be noted, however, that 
many additional interpretations may be found 
in the technical and popular literarure 

The first two characterizations presented 
here relate fairness to absence of bias and to 
equitable treatment of all examinees in the 
testing process. There is broad consensus that 
tests should be free from bias (as defined 
below) and that all examinees should be treat¬ 
ed fairly in the resting process itself (eg. 
afforded the same or comparable procedures in 
testing, test scoring, and use of scores). The 
third characterization of test fairness addresses 
the equality of testing outcomes for examinee 
subgroups defined by race, ethnicity, gender, 
disability, or other characteristics. The idea that 
fairness requires equality in overall passing 
rates for different groups has been almost 
entirely repudiated in the professional testing 
literature. A more widely accepted view would 
hold that examinees of equal standing with 
respect to the construct the test is intended to 
measure should on average earn the same test 
score, irrespective of group membership. 
Unfortunately, because examinees' levels of 
the construct are measured imperfectly, this 
requirement is rarely amenable to direct exami¬ 
nation. The fourth definition of fairness relates 
to equity in opportunity to learn the material 
covered in an achievement test. There would 
be general agreement that adequate opportuni¬ 
ty to learn is clearly relevant to some uses and 
interpretations of achievement tests and clearly 
irrelevant to others, although disagreement might 
arise as to the relevance of opportunity to learn 
to test fairness in some specific situations. 


Fairness as lack of Bias 

Bias IS used here as a technical term. It is 
said to arise when deficiencies in a test itself 
or the mannet in which it is used result in 
different meanings for scores earned by mem¬ 
bers of different identifiable subgroups. When 
evidence of such deficiencies is found at the 
level of item response patterns for members 
of different groups, the terms item bias or dif¬ 
ferential item functioning (DIF) are often used. 
When evidence is found by comparing the 
patterns of association for different groups 
between test scores and other variables, the 
term predictive bias may be used. The concept 
of bias and techniques for its detection ate 
discussed below and are also discussed in 
other chapters of the Standards. There is 
general consensus that consideration of bias 
is critical to sound testing practice. 

Fairness as Equitable Treatment in the Testing 
Process 

There is consensus that just treatment 
throughout the testing process is a necessary 
condition for test fairness. There is also con¬ 
sensus that fair treatment of all examinees 
requires consideration not only of a test itself, 
but also the context and purpose of testing 
and the manner in which test scores are used. 
A well-designed test is not intrinsically fair or 
unfair, but the use of the test in a particular 
circumstance or with particular examinees 
may be fair or unfair. Unfairness can have 
individual and collective consequences. 

Regardless of the purpose of lesring, fair¬ 
ness requires that all examinees be given a 
comparable opportunity to demonstrate 
their sanding on the construct(s) the test is 
intended to measure. Just treatment also 
includes such factors as appropriate testing 
conditions and equal opportunity to become 
familiar with the test format, practice materi¬ 
als. and so forth. In situations where individ¬ 
ual or group test results arc reported, jusr 
treatment also implies that such reporting 
should be accurate and fully informative. 
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Fairness also requires that all examinees 
be afforded appropriate resting conditions. 
Careful standardization of tests and admin¬ 
istration conditions generally helps to assure 
that examinees have comparable opportuni¬ 
ty to demonstrate the abilities or attributes 
to be measured. In some cases, however, 
aspects of the testing process that pose no 
particular challenge for most examinees may 
prevent specific groups or individuals from 
accurately demonstrating their standing 
with respect to the construct of interest 
(e.g., due to disability or language back¬ 
ground). In some instances, greater compa¬ 
rability may sometimes be attained if 
standardized procedures are modified. There 
are contexts in which some such modifica¬ 
tions are forbidden by law and other con¬ 
texts in which some such modifications are 
required by law. In all cases, standardized 
procedures should be followed for all exam¬ 
inees unless explicit, documented accommo¬ 
dations have been made. 

Ideally, examinees would also be afford¬ 
ed equal opportunity to prepare for a test. 
Examinees should in any case be afforded 
equal access to materials provided by the 
testing organization and sponsor which 
describe the test content and purpose and 
offer specific familiarization and preparation 
for test taking. In addition to assuring equi¬ 
ty in access to accepted resources for test 
preparation, this principle covers test securi¬ 
ty for nondisclosed tests. If some examinees 
were to have prior access to the contents of 
a secure test, for example, basing decisions 
upon the relative performance of different 
examinees would be unfair to others who 
did nor have such access. On tests that have 
important individual consequences, all exam¬ 
inees should have a meaningful opportunity 
to provide input to relevant decision makers 
if procedural irregularities in testing are 
alleged, if the validity of the individual's 
score is challenged or may not be reported, 
or if similar special circumscances arise. 


Finally, the conception of fairness as 
equitable treatment in the testing process 
extends to the reporting of individual and 
group test results. Individual test score infor¬ 
mation is entided to confidential treatment in 
most circumstances. Confidentiality should 
be respected; scores should be disclosed only 
as appropriate. When test scores are reported, 
either for groups or individuals, score reports 
should be accurate and informative. It may 
be especially important when reporting 
resuits to nonprofessional audiences to use 
appropriate language and wording and to 
try to design reports to reduce the likelihood 
of inappropriate interpretations. When group 
achievement differences are reported, for 
example, including additional information to 
help the intended audience understand con¬ 
founding factors such as unequal educational 
opportunity may help to reduce misinterpre¬ 
tation of test results and increase the likeli¬ 
hood that tests will be used wisely. 

Fairness as Eouautv in Outcomes oe Testing 

The idea that fairness requires overall 
passing rates to be comparable across groups 
is not generally accepted in the professional 
literature. Most testing professionals would 
probably agree that while group differences in 
testing outcomes should in many cases trigger 
heightened scrutiny for possible sources of 
test bias, outcome differences across groups 
do nor in themselves indicate that a testing 
application is biased or unfair. It might be 
argued that when tests are used for selection, 
persons who all would perform equally well 
on the criterion measure if selected should 
have an equal chance of being chosen regard¬ 
less of group membership. Unfortunately, 
there is rarely any direct procedure for deter¬ 
mining whether this ideal has been met. 
Moreover, if score distributions differ from 
one group to another, it is generally impossi¬ 
ble to satisfy this ideal using any test that has 
a less than perfect correlation with the criteri¬ 
on measure. 
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Many testing professionals would agree 
that if a test is free of bias and examinees 
have received fair treatment in the testing 
process, then the conditions of fairness have 
been met. That is, given evidence of the 
validity of intended test uses and interpreta¬ 
tions, including evidence of lack of bias and 
arrention to issues of fair treatment, fairness 
has been established regardless of group-level 
outcomes. This view need not imply that 
unequal testing outcomes should be ignored 
altogether. They may be important in gener¬ 
ating new hypotheses about bias and fair 
treatment. But in this view, unequal out¬ 
comes at the group level have no direct bear¬ 
ing on questions of test fairness. There may 
be legal requirements to investigate certain 
differences in outcomes of testing among sub¬ 
groups. Those requirements further may pro¬ 
vide that, other things being equal, a testing 
alternative that minimizes outcome differ¬ 
ences across relevant subgroups should be 
used. The standards in this chapter are 
intended to be applied in a manner consistent 
with legal and regulatory standards. 

Fairness as Opportunity to Learn 

This final conception of fairness arises in 
connection with educational achievement test¬ 
ing. In many contexts, achievement tests are 
intended to assess what a test taker knows or 
can do as a result of formal instruction. When 
some test takers have not had the opportunity 
to learn the subject matter covered by the test 
content, they are likely to get low scores. The 
test score may accurately reflect what the test 
taker knows and can do, but low scores may 
have resulted in part from not having had the 
opportunity to learn the material tested as well 
as from having had the opportunity and having 
failed to learn. When test takers have not had 
the opportunity to learn the material tested, the 
policy of using their test scores as a basis for 
withholding a high school diploma, for exam¬ 
ple, is viewed as unfair. This issue is further dis¬ 
cussed in chapter 13, on educational testing. 


At least three important difficulties arise 
with this conception of fairness. First, the 
definition of opportunity to learn is difficult in 
practice, especially at the level of individuals. 
Opportunity is a matter of degree Moreover, 
the measurement of some important learning 
outcomes may require students to work with 
material they have not seen before. Second, 
even if it is possible to document the topics 
included In the curriculum for a group of stu¬ 
dents, specific content coverage for any one 
student may be impossible to determine. 
Finally, there is a well-founded desire to 
assure that credentials attest to certain profi¬ 
ciencies or capabilities. Granting a diploma to 
a low-scoring examinee on the grounds that 
the student had insufficient opportunity to 
learn the material tested means certificating 
someone who has not attained the degree of 
proficiency the diploma is intended to signify. 

It should be noted that opportunity to 
learn ordinarily plays no role in determining 
the fairness of tests used for employment and 
credemialing, which are covered in chapter 
14, nor of admissions testing. In those cir¬ 
cumstances, it is deemed fair that the test 
should cover the full range of requisite 
knowledge and skills. However, there are situ¬ 
ations in which the agency that determines 
the contents of a test used for employment or 
credentialing also sets the curriculum that 
must be followed in preparing to take the 
test. In such cases, it is the responsibility of 
that agency to assure that what is to be tested 
is fully included in the specification of what 
is to be taught. 

Bias Associated With Test Content 
and Response Processes 

The term bias in tests and testing refers to 
construct-irrelevant components that result 
in systematically lower or higher scores for 
identifiable groups of examinees. Such con- 
sirucr-irrelevanr score components may be 
introduced due to inappropriate sampling of 
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test content or lack of clarity in test instruc¬ 
tions. They may also arise if scoring criteria 
fail to credit fully some correct problem 
approaches or solutions that are more typi¬ 
cal of one group than another. Evidence of 
these potentiai sources of bias may be 
sought in rhe content of the tests, in com¬ 
parisons of the internal structure of test 
responses for different groups, and in com¬ 
parisons of the relationships of test scores 
to other measures, although none of these 
types of evidence is unequivocal. 

Content-Reutco Sources of Test Bias 

Bias due to inappropriate selection of 
test content may sometimes be detected by 
inspection of the test itself. In some testing 
contexts, it is common for test developers to 
engage an independent panel of diverse 
experts to review test content for language 
that might be interpreted differently by mem¬ 
bers of different groups and for material that 
might be offensive or emotionally disturbing 
to some test takers. For performance assess¬ 
ments, panels are often engaged to review 
the scoring rubric as well. A tesc incended to 
measure verbal analogical reasoning, for 
example, should include words in general use, 
not words and expressions associated with 
particular disciplines, occupations, ethnic 
groups, or locations. Where material likely 
to be differentially interesting or relevant to 
some examinees is included, it may be bal¬ 
anced by material that may be of particular 
interest to the remaining examinees. 

In educational achievement testing, 
alignment with curriculum may bear on ques¬ 
tions of content-related tesc bias. One may 
ask how well a test represents some content 
domain and also whether that domain is 
appropriate given intended score interpreta¬ 
tions. A test of 19th-century United States 
history might give considerable emphasis to 
the War of 1812, the Mexican War, the Civil 
Wat, and the Spanish American War. If some 
states curriculum framework dealt relatively 


lightly with these wars, devoting more atten¬ 
tion instead, say, to social and industrial 
developments, then that states test takers 
might be relatively disadvantaged. 

Bias may also result from a lack of clarity 
in test instructions or from scoring rubrics 
that credit responses more typical of one 
group than another. For example, cognitive 
ability rests often require test takers to classify 
objects according to an unspecified rule. If a 
given task credits classification on the basis of 
the stimulus objects' functions, but an identi¬ 
fiable subgroup of examinees tends to classify 
the objects on the basis of their physical 
appearance, faulty test interpretations are 
likely. Similarly, if the scoring rubric for a 
constructed response item reserves the highest 
score level for those examinees who in fact 
provide more information or elaboration than 
was actually requested, then less test-wise 
examinees who simply follow instructions will 
earn lower scores. In this case, cestwiseness 
becomes a construct-irrelevant component 
of test scores. 

Judgmental methods for rhe review of 
tests and test items are often supplemented by 
statistical procedures for identifying items on 
tests that function differently across identifi¬ 
able subgroups of examinees. Differential 
item functioning (DIF) is said to exist when 
examinees of equal ability differ on average, 
according to their group membership, in their 
responses to a particular item. If examinees 
from each group are divided into subgroups 
according ro the tested ability and subgroups 
at the same ability level have unequal proba¬ 
bilities of answering a given item correctly, 
then there is evidence that that item may not 
be functioning as intended. It may be meas¬ 
uring something different from the remainder 
of the test or it may be measuring with differ¬ 
ent levels of precision for different subgroups 
of examinees. Such an item may offer a valid 
measurement of some narrow element of the 
intended construct, or it may tap some con¬ 
struct-irrelevant component that advantages 
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or disadvantages members of one group. 
Although DIF procedures may hold some 
promise for improving test quality, there has 
been little progress in identifying the causes 
or substantive themes that characterize items 
exhibiting DIF. That is, once items on a test 
have been statistically identified as function¬ 
ing differently from one examinee group to 
another, it has been difficult to specify the 
reasons for the differential performance or 
to identify a common deficiency among the 
identified items. 

Response-Related Sources of Test Bias 

In some cases, construct-irrelevant score 
components may arise because test items elic¬ 
it varieties of responses other than those 
intended or can be solved in ways that were 
not intended. For example, dienes responding 
to a diagnostic inventory may attempt to pro¬ 
vide the answers they think the rest adminis¬ 
trator expects as opposed to the answers that 
best describe themselves. To the extent that 
such response acquiescence is more typical 
of some groups than others, bias may result. 
Bias may also be associated with test response 
formats that pose particular difficulties for 
one group or another. For example, test per¬ 
formance may rely on some capability (e.g., 
English language proficiency or fine-motor 
coordination) that is irrelevant to the intent 
of the measurement but nonetheless poses 
impediments for some examinees. A test of 
quantitative reasoning that makes inappropri¬ 
ately heavy demands on verbal ability would 
probably be biased against examinees whose 
first language is other than that of the test. 

In addition to content reviews and DIF 
analyses, evidence of bias related to response 
processes may be provided by comparisons of 
the internal structure of the test responses for 
different groups of examinees. If an analysis 
of the factors or dimensions underlying test 
performance reveals different internal struc¬ 
tures for different groups, it may be that dif¬ 
ferent constructs are being measured or it 


may simply be that groups differ in their vari¬ 
ability with respect to the same underlying 
dimensions. When there is evidence that 
tests, including personality tests, measure dif¬ 
ferent constructs in different gender, racial, or 
cultural groups, it is important to determine 
that the interna! structure of the test supports 
inferences made for dienes from these distinct 
subgroups of the client population. In situa¬ 
tions where internal test structure varies 
markedly across ethnically diverse cultures, it 
may be inappropriate to make direct compar¬ 
isons of scores of members of these different 
cultural groups. 

Bias may also be indicated by patterns 
of association between test scores and other 
variables. Perhaps the most familiar form 
such evidence may take is a difference across 
groups in the regression equations relating 
selection test performance to criterion per¬ 
formance. This case is discussed at greater 
length in the following section. However, 
evidence of bias based on relations to other 
variables may also take many other forms. 
The relationship between two tests of the 
same cognitive ability might be found to dif¬ 
fer from one group to another, for example. 
Such a difference might indicate bias in one 
or both cests. As another instance, a higher 
than expected association between reading 
and mathematics achievement test scores 
among students who might well have limit¬ 
ed English proficiency could trigger an 
investigation to determine whether language 
proficiency was influencing some examinees' 
mathematics scores. Patterns of score aver¬ 
ages or othet distributional summaries might 
also point to potential sources of test bias. If 
males outperformed females on one measure 
of academic performance and, in the same 
population, females outperformed males on 
another, it would follow that the two meas¬ 
ures could not both be linearly related to the 
identical underlying construct. Note, howev¬ 
er. that if the tested populations differed, if 
the conrent domains sampled differed, or if 


78 


JA2683 


AERA_APA_NCME_0000087 


Case l:14-cv-00857-TSC Document 60-85 Filed 12/21/15 Page 89 of 100 
USCA Case #17-7035 Document #1715850 Filed: 01/31/2018 Page 380 of 517 

PART II / FAIRNESS IN TESTING AND TEST USE 


the constructs tested otherwise differed due 
to varying motivational contexts ot other 
effects, two reliable tests, each valid for its 
intended purpose, might show such a pat¬ 
tern. Association need not imply any direct 
or causal linkage, and alternative explana¬ 
tions for patterns of association should 
usually be considered. In some cases, a test- 
criterion correlation may arise because the 
test and criterion both depend on the same 
construct-irrelevant ability. If identifiable 
subgroups differ with respect to that extra¬ 
neous ability, then bias may result. 

Fairness in Selection and 
Prediction 

When tests are used for selection and predic¬ 
tion, evidence of bias or lack of bias is gener¬ 
ally sought in the relationships between test 
and criterion scores for the respective groups. 
Under one broadly accepted definition, no 
bias exists if the regression equations relating 
the test and the criterion ate indistinguishable 
for the groups in question. (Some formula¬ 
tions may hold that not only regression slopes 
and intercepts but also standard errors of 
estimate must be equal.) If tcst-ctiterion 
relationships differ, different decision rules 
may be followed depending on the group 
to which the person belongs. 

If fitting a common prediction equation 
for all groups combined suggests that the cri¬ 
terion performance of persons in any one 
group is systematically overpredicted or 
underpredicted, and if bias in the criterion 
measure has been set aside as a possible 
explanation, one possibility is to generate a 
separate prediction formula for each group. 
Another possibility is to seek predictor vari¬ 
ables that may be used in lieu of or in addi¬ 
tion to the initial predictor score to reduce 
differential prediction without reducing over¬ 
all predictive accuracy. If separate regression 
equations are employed, the effect of their 
use on the distribution of predicted criterion 


scores for the different groups should be 
examined. Note that in the United States, the 
use of different selection rules for identifiable 
subgroups of examinees is legally proscribed 
in some contexts. There may, however, be 
legal requirements to consider alternative 
selection procedures in some such situations. 

There is often tension between the per¬ 
spective that equates fairness with lack of 
bias, in rhe technical sense, and the perspec¬ 
tive that focuses on testing outcomes, A test 
that is valid for its intended purpose might be 
considered fair if a given test score predicts 
the same performance level for members of 
all groups. It might nonetheless be regarded 
by some as unfair, however, if average test 
scores differ across groups. This is because a 
given selection score and criterion threshold 
will often result in proportionately more false 
negative decisions in groups with lower mean 
test scores. In other words, a lower-scoring 
group will usually have a higher proportion 
of examinees who are rejected on the basis 
of their test scores even though they would 
have performed successfully if they had been 
selected. This seeming paradox is a statistical 
consequence of rhe imperfect correlation 
between test and criterion. It does not occur 
because of any other property of the test and 
has no direct relationship to group demo¬ 
graphics. It is a purely statistical phenomenon 
that occurs as a function of lower test scores, 
regardless of group membership. For exam¬ 
ple, it usually occurs when the top and bot¬ 
tom test score halves of the majority group 
are compared. The fairness of a test or 
another predictor should be evaluated rela¬ 
tive to that of nontest alternatives that 
might be used instead. 

Group Outcome Differences Due to Choice of 
Predictors 

Success in virtually all real-world 
endeavors requires multiple skills and abili¬ 
ties, which may interact in complex ways. 
Testing programs typically address only a 
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subset of these. Some skills and abilities are 
excluded because they are assessed in other 
components of the selection process (e.g., 
completion of course work or an interview); 
others may be excluded because reliable and 
valid measurement is economically, logisti- 
cally, or administratively infeasible. Success 
in college, for example, requires persever¬ 
ance, motivation, good study habits, and a 
host of other factors in addition to verba! 
and quantitative reasoning ability. Even if 
each of the criteria employed in a selection 
process is demonstrably valid and appropri¬ 
ate for that purpose, issues of fairness may 
arise in the choice of which factors are 
measured. If identifiable groups differ in 
their average levels of measured versus 
unmeasured job-relevant characteristics, 
then fairness becomes a concern at the 
group level as well as the individual level. 

Can Consensus Be Achieved? 

It is unlikely that consensus in society at 
large or within the measurement communi¬ 
ty is imminent on all matters of fairness in 
the use of tests. As noted earlier, fairness is 
defined in a variety of ways and is not 
exclusively addressed in technical terms; it is 
subject to different definitions and interpre¬ 
tations in different social and political cir¬ 
cumstances. According to one view, the 
conscientious application of an unbiased 
test in any given situation is fair, regardless 
of the consequences for individuals or 
groups. Others would argue that fairness 
requires more than satisfying certain techni¬ 
cal requirements. It bears repeating that 
while the Standards will provide more spe¬ 
cific guidance on matters of technical ade¬ 
quacy, matters of values and public policy 
are crucial to responsible test use. 


Standard 7.1 

When credible research reports that test 
scores differ in meaning across examinee 
subgroups for the type of test in question, 
then to the extent feasible, the same forms 
of validity evidence collected for the exam¬ 
inee population as a whole should also be 
collected for each relevant subgroup. 
Subgroups may be found to differ with 
respect to appropriateness of test content, 
internal structure of test responses, the 
relation of test scores to other variables, or 
the response processes employed by indi¬ 
vidual examinees. Any such findings should 
receive due consideration in the interpreta¬ 
tion and use of scores as well as in subse¬ 
quent test revisions. 

Comment: Scores differ in meaning across 
subgroups when the same score produces 
systematically different inferences about 
examinees who are members of different 
subgroups. In those circumstances where 
credible research reports differences in score 
meaning for particular subgroups for the type 
of test in question, this standard calls for 
separate, parallel analyses of data for members 
of those subgroups, sample sizes permitting. 
Relevant examinee subgroups may be defined 
by race or ethnicity, culture, language, gender, 
disability, age, socioeconomic status, or other 
classifications. Not all forms of evidence can 
be examined separately for members of all 
such groups. The validity argument may rely 
on existing research literature, for example, 
and such literature may not be available for 
some populations. For some kinds of evi¬ 
dence, some separate subgroup analyses may 
not be feasible due to the limiccd number 
of cases available. Data may sometimes be 
accumulated so thac these analyses can be 
performed after the test has been in use for a 
period of time. This standard is not satisfied 
by assuring that such groups are represented 
within larger, pooled samples, although this 
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may also be important. In giving “due con¬ 
sideration in the interpretation and use of 
scores," pursuant to this standard, test users 
should be mindful of legal restrictions that 
may prohibit or limit within-group scoring 
and other practices. 

Standard 7.2 

When credible research reports differences 
in the effects of construct-irrelevant variance 
across subgroups of test takers on perform¬ 
ance on some part of the test, the test 
should be used if at all only for those 
subgroups for which evidence indicates 
that valid inferences can be drawn from 
test scores. 

Comment: An obvious reason why a test 
may not measure the same constructs across 
subgroups is that different components come 
into play from one subgroup to another. 
Alternatively, an irrelevant component may 
have a more significant effect on the perform¬ 
ance of examinees in one subgroup than in 
another. Such intrusive elements are rarely 
entirely absenc for any subgroup but are sel¬ 
dom present to any great extent. The decision 
whether or not to use a test with any given 
examinee subgroup necessarily involves a 
careful analysis of the validity evidence for 
different subgroups, as called for in Standard 
7.1, and the exercise of thoughtful profession¬ 
al judgment regarding the significance of the 
irrelevant components. 

A conclusion that a test is not appro¬ 
priate for a particular subgroup requires 
an alternative course of action. This may 
involve a search for a test that can be used 
for all groups or, in circumstances where it 
is feasible to use different construct-equiva¬ 
lent tests for different groups, for an alter¬ 
native test for use in the subgroup for 
which the intended construct is not well 
measured by the current test. In some cases 
multiple tests may be used in combination. 


and a composite that permits valid infer¬ 
ences across subgroups may be identified. 
In some circumstances, such as employment 
testing, there may be legal or other con¬ 
straints on the use of different tests for 
different subgroups. 

It is acknowledged that there are 
occasions where examinees may request or 
demand to take a version of the test other 
than that deemed most appropriate by the 
developer or user. An individual with a 
disability may decline an alternate form 
and request the standard form. Acceding 
to this request, after ensuring that the 
examinee is fully informed about the test 
and how it will be used, is not a violation 
of this standard. 

Standard 7.3 

When credible research reports that differ¬ 
ential item functioning exists across age, 
gender, racial/ethnic, cultural, disability, 
and/or linguistic groups in the population 
of test takers in the content domain meas¬ 
ured by the test, test developers should 
conduct appropriate studies when feasible. 
Such research should seek to detect and 
eliminate aspects of test design, content, 
and format that might bias test scores for 
particular groups. 

Comment: Differential item functioning 
exists when examinees of equal ability 
differ, on average, according to their group 
membership in their responses to a particu¬ 
lar item. In some domains, existing research 
may indicate that differential item function¬ 
ing occurs infrequendy and does not repli¬ 
cate across samples. In others, research 
evidence may indicate that differential item 
functioning occurs reliably at meaningful 
above-chance levels for some particular 
groups; it is to such circumstances that the 
standard applies. Although it may not be 
possible prior to First release of a test to 
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study the question of differential item 
functioning for some such groups, contin¬ 
ued operational use of a test may afford 
opportunities to check for differentia) 
item functioning. 

Standard 7.4 

Test developers should strive to identify 
and eliminate language, symbols, words, 
phrases, and content that are generally 
regarded as offensive by members of racial, 
ethnic, gender, or other groups, except 
when judged to be necessary for adequate 
representation of the domain. 

Comment: Two issues are involved. The first 
deals with the inadvertent use of language 
chat, unknown to the test developer, has a 
different meaning or connotation in one 
subgroup than in others. Test publishers 
often conduct sensitivity reviews of all rest 
material to detect and remove sensitive 
material from the test. The second deals 
with settings in which sensitive material is 
essential for validity. For example, history 
tests may appropriately include material on 
slavery or Nazis. Tests on subjects from the 
life sciences may appropriately include 
material on evolution. A test of under¬ 
standing of an organization’s sexual harass¬ 
ment policy may require employees to 
evaluate examples of potentially offensive 
behavior. 

Standard 7.5 

In testing applications involving individu¬ 
alized interpretations of test scores other 
than selection, a test taker's score should 
nor be accepted as a reflection of standing 
on the characteristic being assessed with¬ 
out consideration of alternate explanations 
for the test taker’s performance on that test 
at that time. 


Comment: Many test manuals point out 
variables that should be considered in inter¬ 
preting test scores, such as clinically relevant 
history, school record, vocational status, and 
test-taker motivation. Influences associated 
with variables such as socioeconomic status, 
ethnicity, gender, cultural background, lan¬ 
guage, or age may also be relevant. In addi¬ 
tion, medication, visual impairments, or 
other disabilities may affect a test taker’s 
performance on, for example, a paper-and- 
pencil test of mathematics. 

Standard 7.6 

When empirical studies of differential pre¬ 
diction of a criterion for members of dif¬ 
ferent subgroups are conducted, they 
should include regression equations (or 
an appropriate equivalent) computed sepa¬ 
rately for each group or treatment under 
consideration or an analysis in which the 
group or treatment variables are entered 
as moderator variables. 

Comment: Correlation coefficients provide 
inadequate evidence for or against a differ¬ 
ential prediction hypothesis if groups or 
treatments are found not to be approxi¬ 
mately equal with respect to both test 
and criterion means and variances. 
Considerations of both regression slopes 
and inrercepis are needed. For example, 
despite equal correlations across groups, 
differences in intercepts may be found. 

Standard 7.7 

In testing applications where the level of 
linguistic or reading ability is not part of 
the construct of interest, the linguistic or 
reading demands of the test should be kept 
ro the minimum necessary for the valid 
assessment of the intended construct. 
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Comment When the intent is to assess ability 
in mathematics or mechanical comprehen¬ 
sion, for example, the test should not con¬ 
tain unusual words or complicated syntactic 
conventions unrelated to the mathematical 
or mechanical skill being assessed. 

Standard 7.8 

When scores are disaggregated and pub¬ 
licly reported for groups identified by 
characteristics such as gender, ethnicity, 
age, language proficiency, or disability, 
cautionary statements should be included 
whenever credible research reports that test 
scores may not have comparable meaning 
across these different groups. 

Comment: Comparisons across groups are 
only meaningful if scores have comparable 
meaning across groups. The standard is 
intended as applicable to settings where 
scores are implicidy or explicidy presented as 
comparable in score meaning across groups. 

Standard 7.9 

When tests or assessments are proposed 
for use as instruments of social, education¬ 
al, or public policy, the test developers or 
users proposing the test should fully and 
accurately inform policymakers of the 
characteristics of the tests as well as any 
relevant and credible information that may 
be available concerning the likely conse¬ 
quences of test use. 

Standard 7.10 

When the use of a test results in outcomes 
that affect the life chances or educational 
opportunities of examinees, evidence of 
mean test score differences between rele¬ 
vant subgroups of examinees should, 
where feasible, be examined for subgroups 
for which credible research reports mean 
differences for similar tests. Where mean 


differences are found, an investigation 
should be undertaken to determine that 
such differences are not attributable to a 
source of construct underrepresentation 
or construct-irrelevant variance. While 
initially the responsibility of the test 
developer, the test user bears responsibility 
for uses with groups other than those 
specified by the developer. 

Comment: Examples of such test uses 
include situations in which a test plays a 
dominant role in a decision to grant or 
withhold a high school diploma or to pro¬ 
mote a student or retain a student in grade. 
Such an investigation might include a 
review of the cumulative research literature 
or local studies, as appropriate. In some 
domains, such as cognitive ability testing 
in employment, a substantial relevant 
research base may preclude the need for 
local studies. In educational settings, as dis¬ 
cussed in chapter 13, potential differences 
in opportunity to learn may be relevant as 
a possible source of mean differences. 

Standard 7.11 

When a construct can be measured in dif¬ 
ferent ways that are approximately equal 
in their degree of construct representation 
and freedom from construct-irrelevant 
variance, evidence of mean score differ¬ 
ences across relevant subgroups of exam¬ 
inees should be considered in deciding 
which test to use, 

Comment: Mean score differences, while 
important, are but one factor influencing 
the choice between one test and another. 
Cost, testing time, test security, and logistic 
issues (c.g., an application where very large 
numbers of examinees must be screened in 
a very short time) are among the issues also 
entering into the professional judgment 
about test use. 
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Standard 7.12 

The testing or assessment process should 
be carried out so that test takers receive 
comparable and equitable treatment dur 
ing all phases of the testing or assessment 
process. 

Comment: For example, should a person 
administering a test or interpreting test 
results recognize a personal bias for or 
against an examinee, or for ot against any 
subgroup of which the examinee is a mem¬ 
ber, the person could take a variety of steps 
ranging from seeking a review of test inter¬ 
pretations from a colleague to withdrawal 
from the testing process. 
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8. THE RIGHTS AND RESPONSIBILITIES 
OF TEST TAKERS 


Background 

This chapter addresses fairness issues unique 
to the interests of the individual test taker. 
Fair treatment of rest takers is not only a mat¬ 
ter of equity, but also promotes the validity 
and reliability of the inferences made from 
the test performance. The standards presented 
in this chapter reflect widely accepted princi¬ 
ples in the field of measurement. The stan¬ 
dards addtess the responsibilities of test takers 
with regard to test security, their access to test 
results, and their rights when irregularities in 
their testing are claimed. Other issues of fair¬ 
ness are treated in other chapters: general 
principles in chapter 7; the testing of linguis¬ 
tic minorities in chapter 9; the testing of per¬ 
sons with disabilities in chapter 10. General 
considerations concerning reports of test 
results are covered in chapter 5. 

Test takers have the right to be assessed 
with tests that meet current professional stan¬ 
dards. including standards of technical quali¬ 
ty, fairness, administration, and reporting of 
results. Fair and equitable treatment of test 
takers involves providing, in advance of test¬ 
ing, information about the nature of the test, 
the intended use of test scores, and the confi¬ 
dentiality of the results. Test takers, or their 
legal representatives when appropriate, need 
enough information about the test and the 
intended use of test results to reach a compe¬ 
tent decision about participating in testing. 
In some instances, formal informed consent 
for testing is required by law or by other stan¬ 
dards of professional practice, such as those 
governing research on human subjects. The 
greater the consequences to the test taker, 
the greater the importance of ensuring that 
the test taker is fully informed about the test 
and voluntarily consents to participate, 
except when testing without consent is per¬ 
mitted by law. If a test is optional, the test 


raker has the right to know the consequences 
of taking ot not taking the test. The test 
taker has the righr to acceptable opportuni¬ 
ties for asking questions or expressing con¬ 
cerns, and may expect timely responses to 
legitimate questions. 

Where consistent with the purposes 
and nature of the assessment, general infor¬ 
mation is usually provided about the test s 
content and purposes. Some programs, in 
the interests of fairness, provide all tesi tak¬ 
ers with helpful materials, such as study 
guides, sample questions, or complete sam¬ 
ple tests, when such information does not 
jeopardize the validity of the results from 
future test administration Advice may also 
be provided about test-taking strategies, 
including time management, and the advis¬ 
ability of omitting an item response, when 
it is permitted. Information is made known 
about the availability of special accommoda¬ 
tions for those who need them. The policy 
on retesting may be stated, in case the test 
taker feels that the present performance 
does not appropriately reflect his/her besi 
performance 

As participants in the assessment, test 
takers have responsibilities as well as rights. 
Their responsibilities include preparing them¬ 
selves for the test, following the directions of 
the rest administrator, representing them¬ 
selves honestly on the test, and informing 
appropriate persons if they believe the test 
results do not adequately reflect them. In 
group testing situations, test takers are expect¬ 
ed not to interfere with the performance of 
other test takers. 

Test validity rests on the assumption 
that a test taker has earned fairly a particu¬ 
lar score or pass/fail decision. Any form of 
cheating, or other behavior that reduces the 
fairness and validity of a test, is irrcsponsi- 
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ble, is unfair to other test takers and may 
lead to sanctions, lr is unfair for a test taker 
to use aids that are prohibited. It is unfair 
for a test taker to arrange for someone else 
to take the test in his/her place. The test taker 
is obligated to respect the copyrights of the 
test publisher or sponsor on all test materials. 
This means that the rest raker will not repro¬ 
duce the items without authorization nor 
disseminate, in any form, material thac is 
dearly analogous to the reproduction of the 
items. Test takers, as well as test administra¬ 
tors, have the responsibility not to compro¬ 
mise security by divulging any details of the 
test items to others nor may they request 
such details from others. Failure to honor 
these responsibilities may compromise the 
validity of test score interpretations for 
themselves and for others. 

Sometimes, testing programs use special 
scores, statistical indicators, and other 
indirect information about irregularities in 
tesring to help ensure that the test scores 
are obtained fairly. Unusual patterns of 
responses, large changes in test scores upon 
retesting, speed of responding, and similar 
indicators may trigger careful scrutiny of 
certain testing protocols. The details of 
these procedures are generally kept secure 
to avoid compromising their use. However, 
test takers can be made aware that in special 
circumstances, such as response or test score 
anomalies, their test responses may get 
special scrutiny, If evidence of impropriety 
or fraud so warrants, the test taker’s score 
may be canceled, or other action taken. 

Because these Standards are directed 
to test providers, and not to test takers, 
standards about test-taker responsibilities 
are phrased in terms of providing informa¬ 
tion ro rest takers about their rights and 
responsibilities. Providing this information 
is the joint responsibility of the test devel¬ 
oper, the test administrator, the test proctor, 
if any, and the test user and may be appor¬ 
tioned according to particular circumstances. 


Standard 8.1 

Any information about test content and 
purposes that is available to any test taker 
prior to testing should be available to all 
test takers. Important information should 
be available free of charge and in accessi¬ 
ble formats. 

Comment: The intent of this standard is 
equal treatment for all. Important informa¬ 
tion would include that necessary for test¬ 
ing, such as when and where the test is 
given, what material should be broughr, 
the purpose of the test, and so forth. More 
detailed information, such as practice mate¬ 
rials, is sometimes offered for a fee. Such 
offerings should be made to all test takers. 

Standard 8.2 

Where appropriate, test takers should be 
provided, in advance, as much information 
about the test, the testing process, the 
intended test use, test scoring criteria, 
testing policy, and confidentiality protec¬ 
tion as is consistent with obtaining valid 
responses. 

Comment: Where appropriate, test takers 
should be informed, possibly by a test bul¬ 
letin or similar procedure, about test con¬ 
tent, including subject area, topics covered, 
and item formats. They should be informed 
about the advisability of omitting responses. 
They should be aware of any imposed time 
limits, so that they can manage their time 
appropriately. General advice should be 
given about test-taking strategy. In computer 
administrations, they should be told 
about any provisions for review of items 
they have previously answered or omitted. 
Test takers should understand the intended 
use of test scores and the confidentiality of 
test results. They should be advised whether 
they will have access to their results. They 
should be informed about the policy con- 
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ccrmng taking the test again and about 
the possibility that some test protocols 
may receive special scrutiny for security 
reasons. Test takers should be informed 
about the consequences of misconduct or 
improper behavior, such as cheating, that 
could result in rheir being prohibited from 
completing the test, receiving test scores, 
or other sanctions. 

Standard 8.3 

When the lest taker is offered a choice of 
test format, information about the charac¬ 
teristics of each format should be provided. 

Comment: Test takers sometimes have to 
choose between a paper-and-pencil admi¬ 
nistration and a computer-administered 
test, which may be adaptive. Some tests 
are offered in several different languages. 
Sometimes an alternative assessment is 
offered in lieu of the ordinary test. Test 
takers need to know the characteristics of 
each alternative so that they can make an 
informed choice. 

Standard 8.4 

informed consent should be obtained from 
lest takers, or rheir legal representatives 
when appropriate, before testing is done 
except (a) when testing without consent 
is mandated by law or governmental regu¬ 
lation, (b) when testing is conducted as 
a regular part of school activities, or (c) 
when consent is clearly implied. 

Comment: Informed consent implies thar 
the test takers or representatives are made 
aware, in language that they can under¬ 
stand, of the reasons for testing, the type 
of tests to be used, the intended use, and 
the range of material consequences of 
the intended use. If written, video, or 
audio records are made of the testing ses¬ 
sion, or other records are kepr, rest rakers 


are entitled to know what testing informa¬ 
tion will be released and to whom. Consent 
is not required when testing is legally man¬ 
dated, such as a court-ordered psychological 
assessment, but there may be legal require¬ 
ments for providing information. When 
testing is required for employment or for 
educational admissions, applicants, by 
applying, have implicitly given consent to 
the testing. Nevertheless, test takers and/ 
or their legal representatives should be 
given appropriate information about a test 
when it is in their interest to be informed. 
Young test takers should receive an explana¬ 
tion of the reasons for testing. Even a child 
as young as two or three, as well as older 
test takers of limited cognitive ability, can 
understand a simple explanation as to why 
they are being tested (such as, “I'm going 
to ask you to try ro do some things so 
that 1 can see what you know how to do 
and what things you could, use some more 
help with”). 

Standard 8.5 

Test results identified by the names of 
individual test takers, or by other perso¬ 
nally identifying information, should be 
released only to persons with a legitimate, 
professional interest in the test taker or 
who are covered by the informed consent 
of the test taker or a legal representative, 
unless otherwise required by law. 

Comment: Scores of individuals identified 
by name, or by some other means by which 
a person can be readily identified, such as 
social security number, should be kept con¬ 
fidential. In some situations, information 
may be provided on a confidential basis ro 
other practitioners with a legitimate interest 
in the particular case, consistent with legal 
and ethical considerations. Information 
may be provided to researchers if a test 
taker’s anonymity is maintained and the 
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intended use is consistent with accepted 
research practice and is not inconsistent 
with the conditions of the tesr taker’s 
informed consent. 

Standard 8.6 

Test data maintained in data files should 
be adequately protected from improper 
disclosure. Use of facsimile transmission, 
computer networks, data banks, and other 
electronic data processing or transmittal 
systems should be restricted to situa¬ 
tions in which confidentiality can be 
reasonably assured. 

Comment: When facsimile or computer 
communication is used to transmit a test 
protocol to another site for scoring, or if 
scores are similarly transmitted, special pro¬ 
visions should be made to keep the infor¬ 
mation confidential. See Standard 5-13. 

Standard 8,7 

Test takers should be made aware that 
having someone else take the test for 
them, disclosing confidential test materi¬ 
al, or any other form of cheating is inap¬ 
propriate and that such behavior may 
result in sanctions. 

Comment: Although the standards cannot 
regulate the behavior of test takers, test 
takers should be made aware of their per¬ 
sonal and legal responsibilities. Arranging 
for someone else to impersonate the nom¬ 
inal test taker constitutes fraud. Disclosure 
of confidential testing material for the pur¬ 
pose of giving other tesr takers pre-knowl¬ 
edge is unfair and may constitute copyright 
infringement, In licensure and certification 
tests, such actions may compromise public 
health and safety. The validity of test score 
interpretations is compromised by inappro¬ 
priate test disclosure. 


Standard 8.8 

When score reporting includes assigning 
individuals to categories, the categories 
should be chosen carefully and described 
precisely. The least stigmatizing labels, 
consistent with accurate representation, 
should always be assigned. 

Comment: When labels are associated with 
test results, care should be taken to be pre¬ 
cise in the meanings associated with the 
labels and to avoid unnecessarily stigmatiz¬ 
ing consequences associated wirh a label. 
For example, in an assessment designed to 
aid in determining whether an individual is 
competent to stand trial, the label ‘'incom¬ 
petent* is appropriate for individuals who 
perform poorly on the assessment. However, 
in a test of basic literacy skills, it is more 
appropriate to use a label such as “not pro¬ 
ficient” rather than “incompetent,” because 
the latter term has a more global and 
derogatory meaning. 

Standard 8.9 

When test scores are used to make deci¬ 
sions about a test taker or to make recom¬ 
mendations to a test taker or a third party, 
the test taker or the legal representative is 
entitled to obtain a copy of any report of 
test scores or test interpretation, unless 
chat right has been waived or is prohibited 
by law or court order. 

Comment: In some cases a test taker may be 
adequately informed when the test repott is 
given to an appropriate third party (treating 
psychologist or psychiatrist) who can inter¬ 
pret the findings to the test taker. In profes¬ 
sional applications of individualized testing, 
when the test caker is given a copy of the 
test report, the examiner or a knowledgeable 
third party should be available to interpret 
it, even if it is clearly written, as the test 
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laker may misunderstand or raise questions 
not specifically answered in the report. In 
employment testing situations, where test 
results are used solely for the purpose of 
aiding selection decisions, waivers of access 
are often a condition of employment, 
although access to test information may 
often be appropriately required in other 
circumstances. 

Standard 8.10 

In educational testing programs and in 
licensing and certification applications, 
when an individual score report is expected 
to be delayed beyond a brief investigative 
period, because of possible irregularities 
such as suspected misconduct, the test 
taker should be notified, the reason given, 
and reasonable efforts made to expedite 
review and to protect the interests of the 
test taker. The test taker should be noti¬ 
fied of the disposition, when the investi¬ 
gation is closed. 

Standard 8.11 

In educational testing programs and in 
licensing and certification applications, 
when it is deemed necessary to cancel or 
withhold a test takers score because of pos¬ 
sible testing irregularities, including sus¬ 
pected misconduct, the type of evidence 
and procedures to be used to investigate 
the irregularity should be explained to all 
test takers whose scores are direedy affected 
by the decision. Test takers should be given 
a timely opportunity to provide evidence 
that the score should not be canceled or 
withheld. Evidence considered in deciding 
upon the final action should be made avail¬ 
able to the test taker on request. 

Comment: Any form of cheating or behavior 
that reduces the validity and fairness of test 
results should be investigated promptly, and 


appropriate action taken. Withholding or 
canceling a test score may arise because of 
suspected misconduct by the test taker, or 
because of some anomaly involving others, 
such as theft, or administrative mishap. An 
avenue of appeal should be available and 
made known to candidates whose scores 
may be amended or withheld. Some testing 
organizations offer the option of a prompt 
and free retest or arbitration of disputes. 

Standard 8.12 

In educational testing programs and in 
licensing and certification applications, 
when testing irregularities are suspected, 
reasonably available information bearing 
directly on the assessment should be con¬ 
sidered, consistent with the need to pro¬ 
tect the privacy of test takers. 

Comment: Unless allegations of misconduct 
are made by associates of the test taker, the 
information to be collected would ordinari¬ 
ly be limited to that obtainable without 
invading the privacy of the test taker or 
his/her associates. 

Standard 8.13 

In educational testing programs and in 
licensing and certification applications, 
test takers are entitled to fair considera¬ 
tion and reasonable process, as appropriate 
to the particular circumstances, in resolv¬ 
ing disputes about testing. Test takers are 
entitled to be informed of any available 
means of recourse. 

Comment: When a test takers score may 
be questioned and may be invalidated, or 
when a test taker seeks a review or revision 
of his/her score or some other aspect of the 
testing, scoring, or reporting process, the 
test taker is entitled to some orderly process 
for effective input into or review of the 
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decision making of the test administrator or 
test user. Depending upon the magnitude of 
the consequences associated with the test, 
this can range from an internal review of all 
relevant data by a test administrator, to an 
informal conversation with an examinee, to 
a full administrative hearing. The greater 
the consequences, the greater the extent of 
procedural protections that should be made 
available. Test takers should also be made 
aware of procedures for recourse, fees, 
expected time for resolution, and any possi¬ 
ble consequences for the test taker. Some 
testing programs advise that the test taker 
may be represented by an attorney, although 
possibly at the test takers expense. 
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UNITED STATES DISTRICT COURT 
FOR THE DISTRICT OF COLUMBIA 


AMERICAN EDUCATIONAL RESEARCH ) 

ASSOCIATION, INC., AMERICAN ) 

PSYCHOLOGICAL ASSOCIATION, INC., ) 

and NATIONAL COUNCIL ON ) 

MEASUREMENT IN EDUCATION, INC., ) Civil Action No. 1:14-cv-00857-TSC-DAR 

) 

Plaintiffs, ) DECLARATION OF KURT F. 

) GEISINGER IN SUPPORT OF 
v. ) PLAINTIFFS’ MOTION FOR 

) SUMMARY JUDGMENT AND ENTRY 
PUBLIC.RESOURCE.ORG, INC., ) OF A PERMANENT INJUNCTION 

) 

Defendant. ) 

_ ) 

I, KURT F. GEISINGER, declare: 

1. I am currently Director of the Buros Center on Testing and W. C. Meierhenry 
Distinguished University Professor at the University of Nebraska-Lincoln. I submit this 
Declaration in support of the motion of the American Educational Research Association, Inc. 
(“AERA”), the American Psychological Association, Inc. (“APA”), and the National Council on 
Measurement in Education, Inc. (“NCME”) (collectively, “Plaintiffs” or “Sponsoring 
Organizations”) for summary judgment and the entry of a permanent injunction. 

2. My curriculum vitae is attached to this Declaration as Exhibit 1. 

3. I received my doctoral degree in Educational Psychology in 1977 from the 
Pennsylvania State University, after previously receiving my masters’ degree in Psychology at 
the University of Georgia and my bachelor’s degree from Davidson College (with honors). I 
also studied German, Psychology and other topics as an undergraduate at the Phillips Universitat 
in Marburg, Germany and at Harvard University when I attended the Institute for Educational 
Management in 1995. 
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4. From 2001 to 2006, I served as the Vice President of Academic Affairs and 
Professor of Psychology at the University of St. Thomas in Houston, Texas, where I was 
responsible for four academic schools, approximately 200 faculty members, and over 4,000 
students. From 1997 to 2001,1 served as Academic Vice President and Professor of Psychology 
at Le Moyne College. From 1992 to 1997 I served as Dean of the College of Arts and Sciences 
and Professor of Psychology at the State University of New York at Oswego. And, from 1977- 
1992 I served as a Professor of Psychology at Fordham University in New York City, where I 
served as department chair for the Department of Psychology and director of the Doctoral 
program in Psychometrics. 

5. Over the past forty years, I have researched, studied, and taught psychometrics 
(psychometrics is the quantitative study of tests and measures in terms of the value, usefulness, 
and interpretation of the results of such measures). I also am a fellow, diplomate, and member of 
numerous professional societies involving educational and psychological testing, such as the 
APA (fellow), the American Association for Assessment Psychology (diplomate), the AERA 
(fellow), and the NCME, as well as other professional associations. I have represented the APA 
by serving on and chairing the Joint Committee on Testing Practices (which is separate from the 
APA committee responsible for the 1999 Standards for Educational and Psychological Testing) 
and have served on the APA’s Committee on Psychological Testing and Assessment. In 2010,1 
was elected to serve two terms (2006-2008 and 2009-2011) as the representative on the Council 
of Representatives for the APA’s Division of Evaluation, Measurement and Statistics. My 
second term was cut short by one year when I was elected to serve as a member-at-large on the 
APA’s Board of Directors in 2010, a position I held for a three-year term (2011-2013). 
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6. I have authored numerous publications about psychological and educational 
testing. I have worked at the Educational Testing Service (“ETS”), chaired its Technical 
Advisory Committee for the Graduate Record Examination (“GRE”), served on the Board of 
Directors for the GRE (a Board that I also chaired), and have been a member of the College 
Board, (formerly known as the College Entrance Examination Board) for which I served (2000- 
2002) on its SAT Committee. I recently concluded a four-year term (2011-2014) on the 
Advisory Research Committee for the College Board, serving the last two years as its chair. I 
currently serve on the Technical Advisory Committee for the Educational Records Bureau 1 and 
on Saudi Arabia’s International Advisory Board for its National Center for Assessment and 
Evaluation. 

7. In 2010, I was elected to the Council (i.e., Board of Directors) for the 
International Test Commission—the primary international testing body. In 2012, I was also 
elected as its Treasurer and to serve on its Executive Council. I am the only American who 
serves on its Executive Council. 

8. I was asked to review and share my comments’ chapters of the 1999 Standards 
for Educational and Psychological Testing, published jointly by the AERA, the APA, and the 
NCME (the “1999 Standards”). The Standards 2 embody the professionally accepted practices 
for testing and measurement. One of the chapters I reviewed was based upon the testing of 
individuals with disabilities, an area in which I have engaged in research and have served as an 
expert witness in federal courts as well as state courts in New York, New Jersey, and California. 


1 The Educational Record Bureau specializes in the development and use of tests and testing 
products for private and independent educational institutions at the p-12 levels. 

2 I use the term “Standards” to refer to the Standards for Educational and Psychological Testing 
as a whole, not a specific version of the Standards, i.e. 1999 or 2014 
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The other chapter related to the rights and responsibilities of test takers. See Exh. 1. I note that 
the Standards were revised again in 2014. 

9. In addition to my 130 plus journal articles and book chapters, I have written, 
edited, or co-edited approximately 15 books and monographs. The vast majority of these 
publications deal with testing and measurement issues. For example, I have edited two books on 
the psychological testing of Hispanics and another I co-edited related to fairness in testing. I 
have also co-edited several books of reviews of published tests and measures. I was also Editor- 
in-Chief for the three-volume Handbook of Testing and Assessment in Psychology (published by 
the APA in 2013) and have been editor of the journal Applied Measurement in Education for the 
past 9 plus years. Taylor & Francis, in conjunction with the Buros Center for Testing publishes 
this journal. 

10. I also co-chaired a sub-committee of the APA’s Joint Committee on Testing 
Practices and the overall committee itself that developed a document on the rights and 
responsibilities of test takers (1993-2001). This document has been endorsed by a number of 
professional associations related to proper test use, including the APA, the National Association 
of School Psychologists, the American Counseling Association, and the NCME. While chairing 
the Joint Committee on Testing Practices, the committee developed a book entitled Assessing 
Individuals with Disabilities, in which I wrote a chapter. I also served on a task force charged to 
illuminate issues related to the testing of individuals with disabilities as well as ethnic minorities. 
The task force wrote and edited a book entitled Test Interpretation and Diversity: Achieving 
Equity in Assessment, which was published by the APA’s publication unit in 1997. I had three 
chapters in that volume. 
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11. I additionally served on an APA task force (2007-2010) that considered the 
assessment and intervention of individuals with disabilities. The results of our work, Guidelines 
for the “Assessment of and Intervention with Individuals with Disabilities,” was published in the 
American Psychologist, the premier publication of the APA (Geisinger et al., 2012) and endorsed 
as the policy of the APA by its governance. A reference for the American Psychologist article 
may be found on my curriculum vitae, which is attached as Exhibit 1. 

12. In the past two years (2014-2015), I have served on two task forces related to the 
use of measures in clinical psychology. One of these has written a policy, recently accepted by 
the APA’s Board of Directors, that differentiates the use of tests and other measures, for 
screening and assessment, two highly related types of testing, but which differ in specificity and 
focus. Tests are usually standardized measures that are given to a number of people for a 
specific purpose. A bar examination would be an example of a test. Measures are other 
variables yielding typically quantitative values that are used to evaluate a person and include 
tests. A bathroom scale results in a measure (weight), but would not normally be considered as a 
test. 

13. During 2013-2014, I served on a committee of the Institute of Medicine (a 
component of the National Academy of Sciences) that evaluated the use of psychological and 
clinical neuropsychological measures by the Social Security Administration in determining 
disability status. The final report, entitled Psychological Testing in the Service of Disability 
Determination, has been published by the National Academy of Sciences and is also available 
from the Ins titute of Medicine’s website. 

14. For approximately four years (2008-2012), I jointly represented three professional 
associations (the AERA, the APA, and the NCME) in developing the International Standards 
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Organization’s (“ISO”) first standard on psychological testing. The results of the work of the 
committee that engaged in this activity was ISO Standard 10677. The standard is divided into 
two parts. The first establishes requirements and guidance for a client working with a service 
provider to carry out the assessment of an individual, a group, or an organization for work- 
related purposes. ISO 10667-1:2011 enables the client to base its decisions on sound assessment 
results. ISO 10667-1:2011 also specifies the responsibilities of a service provider in terms of the 
assessment methods and procedures that can be carried out for various work-related purposes 
made by or affecting individuals, groups or organizations. 

15. I also built or helped to build a number of testing measures. Specifically, I served 
as the primary consultant on a number of civil service examinations given in New York City for 
police officers, sergeants, lieutenants, and captains, fire fighters, fire lieutenants, fire captains, 
sanitation supervisors, and a variety of other civil service occupations over a period of at least a 
decade ending in 1992. I sometimes defended these measures in court. I also represented the 
Public Service Alliance of Canada against the Public Service of Canada in two cases related to 
their national testing efforts and assisted Disability Rights Advocates with regard to several 
testing disputes concerning individuals with disabilities. See Exh. 1. 

16. In recent years, my primary efforts have been to assure testing fairness for those 
with disabilities, language minorities, and ethnic minorities. 

17. I first learned about the Standards for Educational and Psychological Testing 
while I was in my first or second year of graduate school. They are widely discussed in classes 
on testing and testing practice and treated with great respect. Some graduate programs and 
courses require that students purchase the Standards as part of their coursework and education. 

In teaching graduate classes on topics related to testing and associated with the Standards, I often 
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refer to them, building the thoughts and approaches described in the Standards, as well as 
specific standards, into my lectures and classes. I expect students to purchase and read the 
Standards in a number of the classes I taught. When writing chapters and articles on such topics 
as test validity, test reliability, and test fairness—all topics I have discussed in writing—I 
frequently refer to the Standards to check my use of language, my interpretations, and to check 
that I am not omitting a topic of importance relevant to the specific publication. Also, when 
building tests, such as the Police and Fire Department Civil Service tests I helped construct for 
the City of New York, or when serving on technical advisory committees for the well-known 
SAT and GRE committees and boards, I refer to the Standards frequently. Usually, in meetings I 
attempt to express what I believed to be best practices, and then would “back up” my beliefs with 
quotes from the specific and relevant standards. Perhaps my greatest use of the Standards has 
occurred in my legal defense of specific tests or in my critique of particular uses of some tests, 
both of which I have engaged in during my career as an expert witness. 

18. The ultimate advantages of the Standards in my opinion are that they are written 
and edited by first-rate professionals covering a number of the representative fields in which 
testing and assessment are primarily employed, they are thoroughly and publicly vetted by other 
professionals, and they are openly discussed during the revision process at many professional 
conferences. The resultant document becomes a living document of best practices. That the 
members of the committee drafting the Standards are generally extremely highly respected 
professionals in the field of testing and testing practice also provides the Standards great 
credibility. Given my experience over the last 10 years as Director of the Buros Center for 
Testing, thought of by many as the Consumer Reports of the testing industry, and my service as 
the co-editor of the Mental Measurements Yearbooks, where commercially available tests are 
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evaluated, I can state categorically that the Standards serve as the primary basis for all test 
evaluations. The other editors of these Yearbooks and I refer to the Standards with great 
frequency to determine and assure ourselves that the comments made by reviewers are consistent 
with the Standards and that the reviews themselves are based upon principles supported by, and 
coherent with, the Standards. The Standards originally were created as principles and guidelines 
- a set of best practices to improve professional practice in testing and assessment across 
multiple settings, including education and various areas of psychology. The Standards can and 
should be used as a recommended course of action in the sound and ethical development and use 
of tests, and also to evaluate the quality of tests and testing practices. Additionally, an essential 
component of responsible professional practice is maintaining technical competence. Many 
professional associations also have developed standards and principles of technical practice in 
assessment. The Sponsoring Organizations’ Standards have been and still are used for this 
purpose. 

19. The Standards, however, are not simply intended for members of the Sponsoring 
Organizations: AERA, APA, and NOME. The intended audience of the Standards is broad and 
cuts across audiences with varying backgrounds and different training. For example, the 
Standards also are intended to guide test developers, sponsors, publishers, and users by providing 
criteria for the evaluation of tests, testing practices, and the effects of test use. Test user-oriented 
standards refer to those standards that help test users decide how to choose certain tests, interpret 
scores, or make decisions based on test results. Test users include clinical or industrial 
psychologists, research directors, school psychologists, counselors, employment supervisors, 
teachers, and various administrators who select or interpret tests for their organizations. There is 
no mechanism, however, to enforce compliance with the Standards on the part of the test 
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developer or test user. The Standards, moreover, do not attempt to provide psychometric 
answers to policy or legal questions. They do not themselves set requirements, but serve to 
distribute best practices and procedures. 

20. The Standards apply broadly to a wide range of standardized instruments and 
procedures that sample an individual’s behavior, including tests, assessments, inventories, scales, 
and other testing vehicles. The Standards apply equally to standardized multiple-choice tests, 
performance assessments (including tests comprised of only open-ended essays), and hands-on 
assessments or simulations. The main exceptions are that the Standards do not apply to 
unstandardized questionnaires ( e.g ., unstructured behavioral checklists or observational forms), 
teacher-made tests, and subjective decision processes (e.g., a teacher’s evaluation of students’ 
classroom participation over the course of a semester). 

21. The Standards have been used to develop testing guidelines for such activities as 
college admissions, personnel selection, test translations, test user qualifications, and computer- 
based testing. The Standards also have been widely cited to address technical, professional, and 
operational norms for all forms of assessments that are professionally developed and used in a 
variety of settings. The Standards additionally provide a valuable public service to state and 
federal governments as they voluntarily choose to use them. For instance, each testing company, 
when submitting proposals for testing administration, instead of relying on a patchwork of local, 
or even individual and proprietary, testing design and implementation criteria, may rely instead 
on the Sponsoring Organizations’ Standards to afford the best guidance for testing and 
assessment practices. 

22. The Sponsoring Organizations do not keep any of the revenues generated from the 
sales of the Standards. Rather, the income from these sales is used by the Sponsoring 
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Organizations to offset their development and production costs and to generate funds for 
subsequent revisions. This strategy allows the Sponsoring Organizations to develop up-to-date, 
high quality Standards that otherwise would not be developed due to the time and effort that goes 
into producing them. 

23. Without the sales revenue from prior Standards versions (because - if Public 
Resource succeeds in this litigation - this publication will be made freely available online), it is 
extremely unlikely that future updates to the Standards will be undertaken. This well-informed 
opinion is made because NCME is too small an organization to financially support periodic 
updates of the Standards, AERA does not have the budget for it, and an insufficient number of 
psychometricians are members of APA for it to justify the ongoing expenditures. Charging extra 
membership fees to fund ongoing updates to the Standards would never happen, because the 
governing bodies of AERA, APA and NCME would not vote for it. If these Sponsoring 
Organizations ceased updating the Standards, it is unlikely that other organizations would step in 
and continue the effort. Moreover, there are no other organizations with the expertise in their 
memberships to populate such a committee or task force. 

24. There simply is no way for Plaintiffs to calculate with any degree of certainty the 
number of university/college professors, students, testing companies and others who would have 
purchased Plaintiffs’ Standards but for their wholesale posting on Defendant’s 
https://law.resource.org website and the Internet Archive http://archive.org website. 

25. In Fiscal Year (“FY”) 2011 to FY 2012, as compared to FY 2011, the Sponsoring 
Organizations experienced a 34% drop in sales of the 1999 Standards. In FY 2013, sales of the 
1999 Standards remained at their low level from the prior fiscal year (See F. Fevine Declaration, 

]f 18, Exh. OOO). For a publication with the longevity of the 1999 Standards, one otherwise 
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would expect to see a gradual decline in sales year-over-year; not the precipitous drop in sales 
experienced by the 1999 Standards in 2012 and 2013 - even considering that updated Standards 
were published in 2014. It is also clear that this drop did not occur due to the expected 
publication of the 2014 Standards, because they were actually due to be published more than a 
year earlier. Thus, one would have expected such a drop to occur perhaps in 2010 or 2011. 

26. Past harm from Public Resource’s infringing activities includes misuse of 
Plaintiffs’ intellectual property without permission, lost sales that cannot be totally accounted for 
- due to potentially infinite Internet distribution, especially by psychometrics students, and lack 
of funding that otherwise would have been available for the update of the Sponsoring 
Organizations’ Standards from the 1999 to the 2014 versions. 

27. Should Public Resource’s infringement be allowed to continue, the harm to the 
Sponsoring Organizations, and public at large who rely on the preparation and administration of 
valid, fair and reliable tests, includes: (i) uncontrolled publication of the 1999 Standards without 
any notice that those guidelines have been replaced by the 2014 Standards; (ii) future 
unquantifiable loss of revenue from sales of authorized copies of the 1999 Standards (with 
proper notice that they are no longer the current version) and the 2014 Standards; and (iii) lack of 
funding for future revisions of the 2014 Standards and beyond. 

28. The harm caused to the public by publication of out-of-date Standards (not 
labeled as such) will be significant, because the testing and assessment fields are constantly 
changing, given updates in testing technology and ever-evolving collective thought on the 
validity, reliability and fairness of tests. Members of the public who would be harmed by 
discontinued updates of the Standards include psychometrics professors, students and 
professionals, as well as test developers, administrators and takers. 
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I DECLARE, under the penalty of perjury, that the foregoing is true and correct. 


Dated: December 9, 2015 


urn 


>er ^ 


Kurt F. Geisinger 
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EXHIBIT 1 


Updated 12/1/2014 

CURRICULUM VITAE 
Kurt F. Geisinger, Ph.D. 


Current Position : Director, Buros Center for Testing 

W. C. Meierhenry Distinguished University Professor 
The University of Nebraska-Lincoln 


Office Address and Telephone 

21 Teachers College Hall 

Buros Center for Testing 

The University of Nebraska-Lincoln 

Lincoln, NE 68588 

Telephone: 402/472-3280 

FAX: 402/472-6207 

E-Mail: kgeisinger2@unl.edu 


Home Address and Telephone 

6300 Rainier Court 
Lincoln, NE 68510-5050 
Telephone: 402/327-0205 
E-Mail: Kurtgeis 1 @aol.com 


EDUCATION 


A.B. with Honors, Davidson College 
M.S., The University of Georgia 
Ph.D., The Pennsylvania State University 


ADMINISTRATIVE RESPONSIBILITIES 

2006 to the present Director, Buros Center for Testing, University of Nebraska-Lincoln. Direct the 

Buros Institute for Mental Measurments. Supervise director of the Buros Institute for 
Assessment Consultation and Outreach. Provide consultation on assessment issues to 
clients. Editorial and executive leadership to the Mental Measurements Yearbooks, 
Tests in Print, and the journal, Applied Measurement in Education, which I edit. 

Serve as Interim Director of the Buros Institute for Assessment Consultation and 
Outreach effective 9/15/2007. Tenured, chaired professor. 

2001 to 2006 Vice President for Academic Affairs, University of St. Thomas, Houston 

Responsible for programs encompassing over 125 full-time faculty members and 
approximately 5,000 students. Lead deans of Schools of Arts & Science, Business, 
Education, Theology, and Graduate Program in Liberal Arts as well as libraries and 
advisement. Responsible for personnel, student, and budget issues. Lead the college in 
the absence of the President. Tenured full professor. 

1997 to 2001 Academic Vice President, Le Moyne College. 

Responsible for college of over 120 full-time faculty members and approximately 3,000 
students. Lead deans, academic departments, library, admissions, financial aid, registrar, 
academic support center, and continuing education office. Specific personal 
responsibility for running graduate programs in Business (MBA) and Education (M.Ed.). 
Responsible for personnel, student, enrollment and budget issues. Lead the college in the 
absence of the President. Tenured full professor. 
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1992 to 1997 


1985 to 1991 


1979 to 1985 


Dean of Arts and Sciences, State University of New York, College at Oswego. 

Lead 19 academic departments, Biological Field Station, Environmental Research Center, 
Art Gallery, approximately 20 librarians, Office of Learning Support Services, consisting 
of over 230 faculty members and 4900 students. Responsible for personnel, student and 
budget issues. Tenured full professor. 

Chairperson of the Department of Psychology, Fordham University. Administered 
department of 18 full-time faculty members with at least 5 additional FTEs and 
approximately 180 full- and part-time graduate students, 5 doctoral program, and 70 
undergraduate majors. Coordinated extensive faculty evaluation and hiring efforts. 

Director of Doctoral Program in Psychometrics, Department of Psychology, 
Fordham University. Administered doctoral program for approximately 25 graduate 
students. Wrote descriptions of the program, developed a formal curriculum, 
coordinated curriculum and course offerings, developed Ph.D. Comprehensive 
Examinations, advised program students. Coordinated hiring of program faculty. 


AWARDS, RECOGNITIONS, AND HONORARY OFFICES 

American Board of Assessment Psychology, Diplomate (1994) 

Recipient of the Jacob Cohen Award for Distinguished Teaching and Mentoring, American Psychological 
Association, 2008 

Recipient of the President’s Award for Scholarly and Creative Activity, SUNY-Oswego, 1995 
Recipient of the 1997 Leo D. Doherty Award by the Northeastern Educational Research Association for leadership 
in educational research 

Recipient of the 2002 Thomas J. Donlon Award by the Northeastern Educational Research Association for 
distinguished mentoring 

Biographee in Who’s Who in the East (23 rd ed., 24 th ed., 25 th ed., 26 th ed., 27 th ed., 28 th ed.), Who’s Who in America 
(48 th ed., 49 th ed., 50* ed., 54* ed., 55* ed., 56* ed., 57* ed.) . Who’s Who in American Education (4* ed., 
5* ed.). Who’s Who in the World (20* ed.). Who’s Who in Medicine and Healthcare (3 rd ed.). Who’s Who 
in Emerging Leaders in America (4* ed.). Who’s Who in Science and Technology 
Psi Chi (National Psychology Honor Society) 

Sigma Xi (National Scientific Research Society) 

Northeastern Educational Research Association, President for term 1987-1988 
(President-elect, 1986-87); (Past President, 1988-89) 

Northeastern Educational Research Association, Program Committee, 1978 - present (Co-Chair, 1985) 

Northeastern Educational Research Association, Member, Board of Directors for term 1984-87 
Phi Kappa Phi (National Academic Honor Society) 

President, Fordham University Chapter, Phi Kappa Phi, 1985-86 

President-Elect and Acting President, Fordham University Chapter, Phi Kappa Phi 1984-85 
Treasurer (1983-86) and Secretary (1983-84), Fordham University Chapter, Phi Kappa Phi 
Selected as an Outstanding Young Man of America, 1982 
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FACULTY EXPERIENCE 

Faculty Work and Employment 


2006-present 

W. C. Meierhenry Distinguished University Professor, Department of Educational 
Psychology, University of Nebraska-Lincoln. Teach one advanced doctoral seminar 
per semester and run Buros Center for Testing and the Buros Institute for Mental 
Measurements. Beginning in October 2007, directing the Buros Institute for Assessment 
Consultation and Outreach on an interim basis. Serve on departmental and university 
committees. 

1989 to 1992 

Professor of Psychology, Fordham University, (Tenured). Chairperson, Graduate 

School of Arts and Sciences’ Long Term Planning Committee (1989-91); Member, 
University Research Council (1985-1991); Member, Graduate Studies Council (1984- 
1991); Member, University Tenure Review Committee (1990-1993), Chair (1991-92). 

Served on various Departmental and University committees. Served as survey and 
grading consultant to the College Dean. Graduate courses taught included Statistics, 
Psychological Testing, Test Construction, Psychometrics, Survey and Interview 
Methodology, Differential Psychology, Personnel Selection, Program Evaluation, the 
Teaching of Psychology. Undergraduate courses taught: Introductory Psychology, 

Statistics, Research Design, Psychological Testing, and Seminar on Personnel Decisions 
for Police. Supervised 16 doctoral dissertations (with one currently in progress). Served 
on dissertation committees. Supervised Masters’ research projects (have directed twelve 
studies). Advised graduate and undergraduate students. Coordinated faculty evaluation 
via student rating. 

1981 to 1989 

Associate Professor of Psychology, Fordham University, (Tenured 6/83). Essentially 
the same duties and responsibilities as above. On sabbatical, Spring Semester, 

1986, at the Research Division, Educational Testing Service, Princeton, NJ. 

1977 to 1981 

Assistant Professor of Psychology, Fordham University. Essentially the same job 
activities as Professor above. 

1975 to 1976 

Instructor, Departments of Educational Psychology and Psychology, The 

Pennsylvania State University. Taught graduate courses in Educational and 

Psychological Testing. 


Externally Funded Research Activity 


2006-2013 

As Director of Buros Center for Testing at the University of Nebraska, I have brought in 
approximately $350,000/year in contract research. One example is listed below. 

2007-2008 

Project Director. Department of Education, State of Florida ($200,000). Provide consultation to 
the State regarding its statewide testing program, its equating. Discuss implications of testing 
program with legislators and senators as well as Department of Education commissioners and staff 
members. 

1993-94 

Institutional Planning Team Member, American Council on Education/National Endowment 
for the Humanities, Spreading the Word, a program to institute a Modem Languages across the 
curriculum project at the State University of New York at Oswego. 

1988-92 

Faculty Participant and Project Evaluator, Grant ($250,000) from the Fund for the 

Improvement of Post-Secondary Education (FIPSE) to develop a Master of Arts in Liberal 

Studies at Fordham University. 
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1990 to 1991 Project Director, Grant ($9,840) from the American Psychological Association Science 
Directorate with matching grant ($6,560) from the Fordham University Sesquicentennial 
Celebration to host a conference on the Psychological Testing ofHispanics, February 9, 1990, 
New York City. 

1979 to 1980 Project Director, Grant ($50,000) from Harcourt Brace Jovanovich, Inc. to study the effects of 
test use in the school with special emphasis on their use with minority children. Anne Anastasi, 
Principal Investigator. 


1978 to 1979 Research Associate, Grant ($75,000) from Harcourt Brace Jovanovich, Inc. Same study as 
above. Anne Anastasi, Principal Investigator and Project Director. 


SERVICE ON NATIONAL COMMITTEES AND BOARDS 

American Educational Research Association, American Psychological Association & National Council on 

Measurement in Education representative to the International Standards Commission’s Committee on 
International Testing Standards, 2007-2010 

American Psychological Association, Board of Directors, 2011-2013 

American Psychological Association, Council of Representatives member for Division 5 (2006-2010) 

American Psychological Association, Coalition of Academic, Scientific, and Applied Psychology, President-elect 
(2008), President (2009), Past President (2010) 

American Psychological Association, Committee on Psychological Tests and Assessments, 1998 to the 2000 

American Psychological Association, Committee on Psychological Tests and Assessments Task Force on Test 
Interpretation for Diverse Groups, 1993 to the 1997 

American Psychological Association, Division 5 (Evaluation, Measurement & Statistics), Member, Membership 
Committee (1988-1992), Chairperson (1991-92); Member, ad hoc Committee for the Disabled; Member, 
Public Policy Committee (1996-99), Chairperson (1997-98), Executive Committee Member (2006-2008) 

American Psychological Association, Division 15 (Educational Psychology), Member, Early Contributions 
Committee (1990-1993) 

American Psychological Association, Office of Program Accreditation and Consultation, Site Visitor, 1987 to 
2010 


The College Board, Research Advisory Committee, 2010-2012 (Chair 2012) 

The College Board, Middle States Regional Council, 1998-2000 

The College Board, Member, Editorial Board, The College Board Review. 2000-2006 

The College Board, Scholastic Assessment Test Committee, 2000-2003 


Council for the Accreditation of Educator Professionals, Commission on Standards and Performance Reporting, 
Commissioner, 2012-2013 

Council for the Accreditation of Educator Professionals, Commission on Institutional Briefs, 

Commissioner, 2013-present 

Council for the Accreditation of Educator Professionals, Research Committee, Member, 2013-present 
Council for the Accreditation of Educator Professionals, Commission on Standards and Performance Reporting, 
Commissioner, 2012-2013 

Council of Graduate Schools, Committee on Masters’ Education at Predominantly Masters Institutions, 2002- 2006 

Council of Independent Colleges, Committee to Provide a Workshop for New Chief Academic Officers, 2002-2004, 
Chair (2003-2004) 

Educational Testing Service, Member, Panel convened to review test security procedures and processes (1994) 
Graduate Record Examination, Technical Advisory Committee, Member, 1995-2002; Chair 2000-2003 
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Graduate Record Examination, Board, ex officio, 2000-2003 
Graduate Record Examination, Board, 2003-2007 

Graduate Record Examination, Chair-elect, 2004-2005, Chair, 2005-2006, Past Chair, 2006-2007 
Graduate Record Examination, Research Committee, 2000-2007 


International Association of Applied Psychology, Division 2 (Psychological Assessment and Evaluation), 
President-elect (2014-2018), President (2018-2022), Past President (2022-2026). 

International Test Commission, Council Member (2010-2012), Treasurer (2012-2014) 

Joint Committee on Testing Practices, American Psychological Association delegate and Co-Chair (1992-96) 
Joint Committee on Testing Practices, Member and Co-Chair, Understanding Testing Working Group, 1990-94 
(American Psychological Association delegate to the committee) 

Joint Committee on Testing Practices, Member, Testing Individuals with Disabilities Working Group, 1996-2001 
Joint Committee on Testing Practices, Member and Co-Chair, Test Taker Rights Working Group, (1993-2001) 

National Counci I on Measurement in Education, Professional Training and Development Committee (1990-92), 
Chairperson (1991-92) 

National Counci I on Measurement in Education, Member, Ad Hoc Committee to Develop a Code of Ethical 
Standards Committee (1992-94) 

National Council on Measurement in Education, Program Committee Co-Chair (1994) 


EDITORIAL WORK 


2011-2012 
2006 to the present 
2001 to the present 
2000 to the 2006 
2000 to the present 
1992 to 2000 
1992 to the present 
1997 to the present 

1991 to 1997 

1992 to 1995 
1988 to 1991 

1978 to 1983 

1979 to 1984 
1988 

1988 

1985 


Special Issue Editor, International Journal of Testing (Volume 12, Issue 2) 

Editor, Applied Measurement in Education 

Consulting Editor, Practical Assessment, Research, and Evaluation 
Consulting Editor, College Board Review 
Consulting Editor, International Journal of Testing 
Member, Editorial Board, Psychological Assessment 
Member, Editorial Board, Educational Research Quarterly 
Member, Editorial Board, ITEMS 

Member, Board of Cooperating Editors, Educational and Psychological Measurement 
Member, Advisory Board, Educational Measurement: Issues and Practice 

Co-Editor, The NERA Researcher (the quarterly newsletter of the Northeastern 
Educational Research Association) 

Consulting Editor, Improving College and University Teaching 
Consulting Editor, Journal of Educational Research 

Consultant, Psychology of Work Behavior (4th ed.), by F. J. Landy. Homewood, IL: 
Dorsey Press. 

Consultant, Psychology (2nd ed.), by L. T. Benjamin, J. R. Hopkins, and J. R. Nation, 
New York: Macmillan. 

Consultant, Psychology: The Science of People (2nd ed.), by F. J. Landy. Englewood 
Cliffs, NJ: Prentice-Hall, Inc. 
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1983 Consultant and Critical Reader, Psychology: The Science of People, by F. J. Landy. 

Englewood Cliffs, NJ: Prentice-Hall, 1984. 

1982 Editorial Consultant, Applied Psychology in Occupational Organizations, by L. R. 

Aiken. Reading, MA.: Addison-Wesley. 

1982 Editorial Consultant, The Handbook of Questionnaire Construction, by J. R. Jacoby. 

New York: Academic Press, 1984. 

1982 Editorial Consultant, Principles and Techniques of Questionnaire Design, by F. J. 

Kviz and W. L. Kreitman. New York: Academic Press, 1984. 

1980 Reviewer, Applied Psychometrics, by R. L. Thorndike. Boston: Houghton Mifflin 

Company, 1982. 

1980 Critical Reader, Psychology, by C. Wortman and E. Loftus, New York: Random House, 

1981. 

1980 Editorial Consultant, Psychology at Work: An Introduction to Industrial Psychology, 

by J. P. Houston and L. M. Berry. Boston: Addison-Wesley. 

1978 Critical Reader, Psychology Today: An Introduction, by J. Braun and D. E. Linder. 

New York: Random House, 1979. 

MEMBERSHIPS IN PROFESSIONAL ASSOCIATIONS 

American Association for Higher Education 
American Educational Research Association 

American Psychological Association (Divisions 2 [Teaching of Psychology], 5 [Measurement, Evaluation, 
Statistics and Assessment], 14 [IndustriaEOrganizational Psychology]and 15 [Educational 
Psychology); Fellow in Divisions 5, 15, and 52 
American Psychological Society, Charter Fellow 
College Board 
Council of Graduate Schools 
Eastern Psychological Association 
National Council on Measurement in Education 
Northeastern Educational Research Association 
Northern Rocky Mountain Educational Research Association 
Society of Psychologists in Management 


SELECTED CONSULTING 


2008 Measured Progress. Serve on a panel to consider the testing of students with 

disabilities. 

2008 The College Board. Considered validation report of the new SAT. 

2007 W.W. Norton (Publishing house). Served on a panel to make recommendations on 

improving undergraduate assessment. 

1995-2003 Educational Testing Service. Served on and chaired (2000-2003) the Technical 

Advisory Committee for the GRE. Served on other paid committees related to the SAT 
and the GRE. 

2001,2002, 2005-6 Disability Rights Advocates. Testified before a panel formed to recommend the 

flagging policies on the SAT and other College Board examinations to the College Board 
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2002-2003 

1999 

1986 to the 2001 

1989 to 1995 

1995 to 1998 

Prior to 1995 


and against the Association of American Medical Colleges in a case relating to the 
provision of accommodations to individuals with disabilities. 

U.S. Department of Justice. Consider the validity of the Law School Admission Test 
for the Department of Justice and wrote reports regarding the same. 

Steptoe & Johnson, LLP. Provided a deposition and testimony for the United States 
District Court of Eastern Pennsylvania regarding the use of flagging for a professional 
school licensure test. 

Fox and Fox, Counselors at Law, Newark, NJ. Serve as an expert witness and 
consultant in cases related to the title of Engineer, Department of Transportation, and 
Fire Captain. Consulted on other cases related to police and fire matters. 

Cornell University, New York State School of Industrial and Labor Relations. 

(New York City). Delivered lecture entitled “E.E.O. Selection” several times a year as 
part of their “Human Resources Programs: Professional Workshop Series in New York 
City.” 

American Board of Physical Therapists, Alexandria, VA. Provide guidance with 
regard to the passing score for some certification examinations. 


Prior to 1992,1 was engaged in considerable consultation with a variety of states, municipalities, and unions. This 
consultation generally concerned test development and often involved my leading major test construction projects in 
civil service, industrial, and educational testing. I gained considerable understanding of the functioning of 
organizations in so doing. 

For approximately 10 years from the early 1980s through the 1990s, I was the primary consultant to New York City 
Department of Personnel building and defending in court Police Officer and Fire Fighter Examinations. This 
involved examinations for the entry-level positions as well as all levels of promotion including for police service of 
Sergeant, Lieutenant, and Captain for the New York City Police Department, the Transit Police and the Housing 
Police. It also included working with all ranks in the Fire Department, through Chief of the Department, as well as 
positions in Sanitation (entry-level and promotional), Social Services, Parks and Recreation, and Health Services. I 
built the examination used for the hiring of Test and Measurement Specialists in the New York City. 

I served as an expert witness (working with the New York City Department of Law) in a number of cases 
defending New York City personnel examinations and in one case, against the examination for Parks and Recreation 
Worker. I served as an expert witness and consultant to the Public Service Alliance of Canada, the union of 
federal employees in Canada, in cases against three tests, a Canadian Intelligence Test, a Canadian Customs Officer 
Supervisor Test, and a Office Manager Test. 

Together with Dr. Richard R. Reilly of Assessment Alternatives, I performed job analyses for the New Jersey Civil 
Service Commission of a number of police positions. 

While in graduate school at the Pennsylvania State University, for almost two years I directed a court-ordered study 
that ultimately brought women onto the police force in the City of Philadelphia. Prior to and during this study, I 
went through various aspects of police training, worked closely with uniformed police representatives of the 
department, developed rating scales for the evaluation of police officers. As stated above, this was a full-time 
position for approximately 1.5 years with Bartell Associates, of State College, Pennsylvania. 
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RESEARCH ACTIVITY AND PUBLICATIONS 

Articles for Journals 

Dahlman, K. A. & Geisinger, K. F. (2105, in press). The Prevalence of Measurement in Undergraduate 
Psychology Curricula across the United States. Scholarship of Teaching and Learning in Psychology. 

Lee, H. & Geisinger, K. F. (In press.) The Matching Criterion Purification for DIF Analyses in a Large-scale 
Assessment. Educational and Psychological Measurement. 

Brabeck, M. M., Dwyer, C. A., Geisinger, K. F., Marx, R. W., Noell, G. W., Pianta, R. C., Subotnick, R. F., & 
Worrell, F. C. (2015, in press). Assessing the assessments of teacher preparation. Theory into Practice, 
DOI: : 10.1080/00405841.2015.1036667 

Lee, H. & Geisinger, K.F. (2014). The Effect of Propensity Scores on DIF Analysis: Inference on the Potential 
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(With A. N. Wilson and J. J. Naumann.) Educational and Psychological Measurement . 1980, 
40,413-417. 
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Who are giving all those A’s? An examination of high grading college faculty members. The Journal of 
Teacher Education. 1980, 31 (March-April), 11-15. 
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Psychology. Thousand Oaks, CA: Sage 
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Geisinger, K. F. (2012.) Norm- and criterion-referenced testing. In H. Cooper (Ed.), Handbook of research 
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Washington, DC: APA Books. 
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(4th ed.). (Vol. 1; p. 246.) New York: Wiley. 

Psychometrics: Norms, Reliability, Validity, and Item Analysis. (2010). In I. Weiner & W. E. Craighead (Eds.) 
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Thousand Oaks, CA: Sage. 
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& D. K. Smith (Eds.), Assessing Individuals with Disabilities in Educational, Employment, and 
Counseling Settings , (pp. 33-42). Washington, DC: American Psychological Association. 

Standards and standardization. (With J. F. Carlson.) (2002). In J. N. Butcher (Ed.), Clinical Personality 
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Testing accommodations for the new millennium: Computer-administered testing in a changing society. In Niyogi, 
S. (Ed.). New Directions in Assessment for Higher Education: Fairness. Access. Multiculturalism & 

Equity (FAME) . The Graduate Record Examination FAME Report Series, No. 2, (1998), pp. 12-20. 
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375-386. 
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682-685. 
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Bogardus Social Distance Scale. In R. J. Corsini (Ed.), Wiley encyclopedia of psychology . New York: 

Wiley, 1984. Volume I, p. 160. 

Psychometrics. In R.J. Corsini (Ed.). Wiley encyclopedia of psychology . New York: Wiley, 1984. Volume 
3, pp. 163-165. 

Questionnaires. In R. J. Corsini (Ed.), Wiley encyclopedia of psychology . New York: Wiley, 1 84. Volume 3. pp. 
199-200. 

Test standardization. In R.J. Corsini (E d Wiley encyclopedia of psychology . New York: Wiley, 1984. 

Volume 3. p. 414. 

Factor analytic studies of the McGill Pain Questionnaire. (With E. J. Prieto). In R. Melzack (Ed.), Pain 
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measurement and assessment . New York: Raven Press, 1983. pp. 63-70. 

Marking systems. In H. E. Mitzel (Ed.), Encyclopedia of educational research (5th Ed.) New York: The 
Free Press, 1982. Vol. 3: 1139-1149. 

Grading attitudes and practices among college faculty members. (With W. Rabinowitz.) In H. Dahle, A. 

Lysne, & P. Rand (Eds.), A Spotlight on educational problems . Oslo, Norway: Universitets 
Forlaget, 1979. pp. 145-172. (Distributed in the United States by the Columbia University 
Press, Irvington, NY). 

Developing an operational model for assessing experiential learning. (With W. W. Willingham.) In W. W. 
Willingham & H. S. Nesbitt (Eds.), Implementing a program for assessing experiential learning . 
Princeton, NJ: Cooperative Assessment of Experiential Learning (Educational Testing Service), 

1976. Chapter l;pp. 1-15. 

Overview of CAEL field research. (With W. W. Willingham.) In W. W. Willingham & Associates, The 

CAEL validation report , Princeton, NJ: Cooperative Assessment of Experiential Learning (Educational 
Testing Service), 1976. Chapter III, pp. 1-35. 

Data analysis. (With R. R. Reilly & W. W. Willingham.) In W. W. Willingham & Associates, The CAEL 
validation report , Princeton, NJ: Cooperative Assessment of Experiential Learning (Educational 
Testing Service), 1976. Appendix 5, pp. 5-1 - 5-42. 
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Books and Monographs 

Pardes, H., Barsky, A., Daly, M., Geisinger, K. F., Gerber, N., Jette, A., Koop, J. Suzuki, L. A., Twamley, E., Ubel, 
PI, & Wall, J. (2015, in press/ Psychological testing in the service of disability determination. 

Washington, DC: National Academies Press (Institute of Medicine report/to be published by the Institute 
of Medicine). 

Geisinger, K. F. (2015, in press). (Ed.) Psychological testing of Hispanics: Clinical, cultural, and intellectual 
assessment. Washington, DC: American Psychological Association. 

Carlson, J. F., Geisinger, K. F. & Jonson, J. (Eds.) (2014) The Nineteenth Mental Measurements Yearbook. 

Lincoln, NE: Buros Institute of Mental Measurements. 

Worrell, F, C., Brabeck, M. M., Dwyer, C. A., Geisinger, K. F., Marx, R. W., Noell, G. FL, & Pianta, R.C. (2014). 
Assessing and evaluating teacher preparation programs: An APA Task Force Report. Washington, DC: 
American Psychological Association. 

Schlueter, J. E., Carlson, J. F., Geisinger, K. F., and Murphy, L.L. (Eds.) (2013). Pruebas Publicadas en Espahol: 
An index of Spanish tests in print. Lincoln, NE: Buros Center for Testing. 

Geisinger, K. F. (Ed.). (2013). Handbook of testing and assessment in psychology (3volumes). Washington, DC: 
American Psychological Association. 

Murphy, L. M., Geisinger, K. F., Carlson, J. F., & Spies, R. S. (2011). Tests in print VIII. Lincoln, NE: Buros 
Institute of Mental Measurements. 

Bovaird, J., Geisinger, K. F., & Buckendahl, C. B. (Eds.), (2011.) High Stakes Testing: Science and Practice in K- 
12 Settings. Washington, DC: American Psychological Association. 

Spies, R. A., Carlson, J. F., & Geisinger, K. F. (Eds.) (2010). The Eighteenth Mental Measurements Yearbook. 
Lincoln, NE: Buros Institute of Mental Measurements. 

Geisinger, K. F., Spies, R. A., Carlson, J.F., & Plake, B. S. (2007). The Seventeenth Mental Measurements 
Yearbook. Lincoln, NE: Buros Institute for Mental Measurements. 

Sandoval, J., Frisby, C., Geisinger, K.F., Scheuneman, J., & Ramos-Grenier, J. M. (Eds.). (1998). Test interpretation 
and diversity: Achieving equity in psychological assessment. Washington, DC: American Psychological 
Association. 

Lloyd, B., Crocker, L., Geisinger, K.F., & Webb, M. (1994). Report of the panel convened to review test security 
procedures at the Educational Testing Service in February, 1994. Princeton, NJ: Educational Testing 
Service. 

Geisinger, K. F. (1992). Psychological Testing of Hispanics (Ed.), Washington, DC: American Psychological 
Association, 1992. 

Geisinger, K. F. & Anastasi, A. Instructor’s manual to accompany Psychological testing, A. Anastasi, Sixth 
Edition. New York: Macmillan, 1988. 

Geisinger, K. F. & Anastasi, A. Instructor’s manual to accompany Psychological testing, A. Anastasi, Fifth 
Edition. (With A. Anastasi and S. Urbina.) New York: Macmillan, 1982. 

Anastasi, A. & Geisinger, K. F. Use of tests with schoolchildren JSAS Catalogue of Selected Documents in 
Psychology, 1981 11, ERIC Document No. ED 194-635. 
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Videotapes 

The ABC’s of School Testing . (Project Co-director, with J. J. Fremer and J. Wall). Co-author of the 

Leader’s Guide for the videotape. Both are available from the National Council on Measurement in 
Education, Washington, DC. (1994). 


Book Reviews 

Review of The Conditions of Admission: Access, Equity, and the Social Contract of Public Universities by John 
Aubrey Douglas. Educational Horizons. 2008, 86, 182-185. 

One is the loneliest number: Two is not as bad as one (in some instances). (With S. L. Davis.) Review of Dyadic 
Data Analysis by D. A. Kenny, D. A. Kashy, W. L. Cook. Journal of Clinical and Social Psychology. 
2008,27,311-313. 

Review of Making sense of college grades . (By O. Milton, H. R. Pollio, and J. A. Eison.) Journal of 
Educational Measurement, 1988,25, 167-170. 

Review of Support for teaching at major universities . (Edited by S. C. Ericksen with J. A. Cook.) Improving 
College and University Teaching, 1980,28, 41. 


Paper Presentations 

Geisinger, K. F. (2015). The ITC Guidelines on Quality Control in Scoring, Test Analysis and Reporting of Test 

Scores. In A. Odendall (Chr.), The International Test Commission’s Guidelines for Good Testing Practice. 
Symposium presented at the annual meeting of the Society for Industrial and Organizational Psychology, 
Philadelphia, PA. April. 

Geisinger, K. F. (2015). Using ITC Guidelines. In D. Bartram (Chr.), Executive Board Special Session: Improving 
International Testing Practice with the International Test Commission. Symposium presented at the annual 
meeting of the Society for Industrial and Organizational Psychology, Philadelphia, PA. April. 

Geisinger, K. F. (2015). Global transportability of measures. In Y. Yang & T. L. Hayes (Co-chairs), 

Transportability: Boundaries, Challenges, and Standards. Symposium presented at the annual meeting of 
the Society for Industrial and Organizational Psychology, Philadelphia, PA. April. 

Geisinger, K. F. (2015). Publishing in Applied Measurement in Education. Roundtable presented at the annual 
meeting of the American Educational Research Association, Chicago, IL. April. 

Geisinger, K. F. (2015). General Overview of Standards for Technical Quality. In Worrell, F. (Chr.), Higher 

Education Assessment: Evaluating and Assessing Teacher Preparation Programs. Symposium presented at 
the annual meeting of the American Educational Research Association, Chicago, IL, April. 

Geisinger, K. F. (20154). Test reviewing at the Buros Center for Testing. In T. Patelis (Chr.), Various Efforts to 

Evaluate the Quality of Assessment Programs. Symposium presented at the annual meeting of the National 
Council on Measurement in Education. Chicago, IL, April. 

Geisinger, K. F. (2015). The assessment of 21 sts Century skills: A global perspective. Invited address at the 
University of Luxembourg, Luxembourg. March 4,2015. 

Geisinger, K. F. (2014). Keynote interview. In Kristen Huff, (Chair). Invited keynote presentation at the annual 
meeting of the Northeastern Educational Research Association, October, Trumbull, CT. 


JA2725 












Case l:14-cv-00857-TSC Document 60-88 Filed 12/21/15 Page 31 of 48 
USCA Case #17-7035 Document #1715850 Filed: 01/31/2018 Page 422 of 517 

Kurt F. Geisinger, Ph.D. Page 19 


Geisinger, K. F. (2014). The Buros Approach to Ensuring Quality. In T. Patelis (Chr.)/Ensuring the quality of 
assessments. Symposium presented at the annual meeting of the Northeastern Educational Research 
Association, October, Trumbull, CT. 

Geisinger, K. F. (2014). How do we ensure fairness? In T. Patelis (Chr.), Fairness issues in assessment and 

accountability. . Symposium presented at the annual meeting of the Northeastern Educational Research 
Association, October, Trumbull, CT. 

Geisinger, K. F. (2014). Evaluating tests: A continuing effort for psychologists. Invited divisional keynote 
presentation, International Congress of Applied Psychology, Paris, France, July, 2014. 

Geisinger, K. F. (2014). International technical standards for test quality and test reviewing. In D. Bartram (Chair), 
Symposium presented at the International Congress of Applied Psychology, Paris, France, July, 2014.. 

Geisinger, K. F. (2014). Assessing 21 st Century Skills. Invited workshop presented at the biannual meeting of the 
International Test Commission, San Sebastian, Spain, July, 2014. 

Geisinger, K. F. (2014). Preparing doctoral-level psychometrics specialists. In T. Oakland (Chair), How do we 

prepare psychometric specialists. Symposium presented at the biannual meeting of the International Test 
Commission, San Sebastian, Spain, July, 2014. 

Geisinger, K. F. (2014). Applied Measurement in Education. Roundtable with a journal editor presented at the 
annual meeting of the American Educational Research Association, Philadelphia, PA, April, 2014. 

Lee, H. & Geisinger, K. F. (2014). Differential item functioning analysis models in large-scale assessment. Paper 
editor presented at the annual meeting of the American Educational Research Association, Philadelphia, 
PA, April, 2014. 

Lee, H. & Geisinger, K. F. (2014). Purification of the matching criterion in the equated pooled booklet method for 
DIF. Paper editor presented at the annual meeting of the National Council on Measurement in Education, 
Philadelphia, PA, April, 2014 

Geisinger, K. F. (2013). Outcomes assessment. Lecture presented at King Fahd University of Petroleum and 
Minerals, Dhahran, Saudi Arabia, November. 

Geisinger, K. F. (2013). Classroom assessment. Workshop presented at King Fahd University of Petroleum and 
Minerals, Dhahran, Saudi Arabia, November. 

Geisinger, K. F. (2013). Best practices for faculty in graduate admissions. Workshop presented at King Fahd 
University of Petroleum and Minerals, Dhahran, Saudi Arabia, November. 

Geisinger, K. F. (2013). Grading: Assessment technique and learning facilitator. Workshop presented at King 
Fahd University of Petroleum and Minerals, Dhahran, Saudi Arabia, November. 

Geisinger, K. F. (2013). Setting the minimum passing score for the CFA examination. Workshop presented to the 
Chartered Financial Analyst Board of Governors, London, England, November. 

Geisinger, K. F. (2013). Tensions between Educational/Political Realities and Reliability and Validity. In F. 

Worrell (Chair), Effective use of data for program improvement. Symposium presented at the annual 
meeting of the American Psychological Association, Honolulu, August. 

Geisinger, K. F. (2013). Building unbiased assessments. Workshop presented to the faculties of the Bryan College 
of Health Sciences, Clarkson College, and Nebraska Methodist College. 
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Lee, H.S. & Geisinger, K. F. (2013). Efficiency of Generalized Full Information Bifactor Model. Poster presented 
at the annual meeting of the National Council on Measurement in Education, San Francisco, April. 

Geisinger, K. F. (2013) Contributions of Anne Anastasi. In S. Sinharay (Chair), A look at our psychometric 
history: Contributions of Thurstone, Lindquist, Anastasi, Bock, Messick, and Holland. Symposium 
presented at the annual meeting of the National Council on Measurement in Education, San Francisco, 
April. 

Geisinger, K. F. (2013). The future of admissions testing in the United States. Invited keynote at the Buros “Big 
Issues in Testing Conference.” Lincoln, NE, March. 

Geisinger, K. F. (2013). A testing course focusing on diversity issues. In K. F. Geisinger (Chair), Making a 

quantitative program more multicultural. Symposium presented at the National Multicultural Summit and 
Conference, Houston, January. 

Geisinger, K. F. The future content in admissions testing. Invited presentation to the First International Conference 
on Assessment & Evaluation: Admissions Criteria in Higher Education, Riyadh, Saudi Arabia, December, 
2012. 

Geisinger, K. F. (2012). Criterion-referenced testing. Paper presented at the Ronference Honoring Professor 
Ronald Hambleton, Amherst, MA, November. 

Geisinger, K. F., Carlson, J. F., & Jonson, J. Evaluating tests: Fundamental concepts and skills for psychologists 
and researchers. Continuing Education Workshop presented at the annual meeting of the American 
Psychological Association, August, 2012. 

Geisinger, K. F. & Bartram, D. International Perspectives on Test Reviewing. Paper presented at the Quadrennial 
meeting of the International Congress of Psychology, Cape town, SA, July, 2012. 

Geisinger, K. F. Evaluating tests: Fundamental concepts and skills for psychologists and researchers. Workshop 

presented at the biannual meeting of the International Test Commission, Amsterdam, the Netherlands, July, 
2012 

Geisinger, K. F. Languages and linguistic diversity. In P. Elosua (Chair), Linguistic diversity and testing. 

Symposium presented at the biannual meeting of the International Test Commission, Amsterdam, the 
Netherlands, July, 2012. 

Geisinger, K. F. The testing of multiple languages in a single country. In . In D. Sandilands (Chair), Assessment of 
linguistic minority students in Canada and the United States. Symposium presented at the biannual 
meeting of the International Test Commission, Amsterdam, the Netherlands, July, 2012. 

Geisinger, K. F. Some thoughts on international test adaptations. In J-L. Padilla (Chair), Challenges of test 

adaptation in special contexts: The role of the ITC Guidelines. Symposium presented at the biannual 
meeting of the International Test Commission, Amsterdam, the Netherlands, July, 2012. 

Carlson, J. F. & Geisinger, K. F. (2012) Test reviewing at the Buros Center for Testing. In K. F. Geisinger (Chair), 
International Perspectives on Test Reviewing. Symposium presented at the biannual meeting of the 
International Test Commission, Amsterdam, the Netherlands, July, 2012. 

Geisinger, K. F. (2012). A 50,000 foot view on observed score equating. In M. Wiberg (Chair), New 
developments in observed score equating. Symposium presented at the biannual meeting of the 
International Test Commission, Amsterdam, the Netherlands, July, 2012. 

McCormick, C. M., Shaw, L. H., Evers, A., & Geisinger, K. F. (2012). A multilevel approach to the EFPA/ITC 
questionnaire on test attitudes. In A. Evers (Chair), Attitude of psychologists on tests and testing: The 


JA2727 



Case l:14-cv-00857-TSC Document 60-88 Filed 12/21/15 Page 33 of 48 
USCA Case #17-7035 Document #1715850 Filed: 01/31/2018 Page 424 of 517 

Kurt F. Geisinger, Ph.D. Page 21 


results of an international survey. Symposium presented at the biannual meeting of the International Test 
Commission, Amsterdam, the Netherlands, July, 2012. 

Geisinger, K. F. Cultural bias in testing. Presentation at the First Session of the Summer Faculty and Staff 
Development Series, BryantLGH College of Health Sciences, May, 2012. 

Geisinger, K. F. Anne Anastasi's Views on Ability and Achievement. Invited paper presented at the Hertz 

Memorial Presentation in Memory of Anne Anastasi at the annual meeting of the Society for Personality 
Assessment, Chicago, IL, March, 2012. 

Geisinger, K. F. & Shaw, L. H. Evaluation of Accuplacer®, PSAT/NMSQT, and SAT program features. 

Presentation to the Research Division of the College Board, New York City (also broadcast to Newtown, 
PA). March, 2012. 

Geisinger, K. F. & Patelis, T. Maintenance schedules for quality. Presentation to the Research Advisory Committee 
of the College Board, Phoenix, AZ, March, 2012. 

Geisinger, K. F. The future of admissions testing. Invited presentation at the ETS Conference on the Future of 
Learning, Education and Assessment, Educational Testing Service, Princeton, NJ, March, 2012. 

Geisinger, K. F. The scholarly and fair evaluation of psychological tests and assessments: English language and 
adapted tests. Invited workshop at the First Caribbean Regional Conference on Psychology, Nassau, 
Bahamas, November, 2011. 

Geisinger, K. F. Testing and psychometrics at NERA. In R. Michel (Chair). Designing statewide testing 
programs. Symposium presented at the annual meeting of the Northeastern Educational Research 
Association, Hartford, CT, October 2011. 

Geisinger, K. F. If we could change K-12 testing today. In T. Patelis (Chair). Past presidents discuss educational 
research. Symposium presented at the annual meeting of the Northeastern Educational Research 
Association, Hartford, CT, October 2011. 

Geisinger, K. F. Change and stability: Revisiting new recurrent concerns. In K. F. Geisinger, (Chair). Issues in 
large scale testing. Symposium presented at the annual meeting of the Northeastern Educational Research 
Association, Hartford, CT, October 2011. 

Geisinger, K. F. Diversity and psychometrics: A necessary but almost null hypothesis. In Diversity in 

Psychometrics, P. Scott-Johnson (Chr.), Symposium presented at the annual meeting of the American 
Psychological Association, Washington, DC, August, 2011. 

Geisinger, K. F. Test reviewing at the Buros Center for Testing. In D. Bartram (Chr.), Internationalization of test 
reviewing. Symposium presented at the biannual meeting of the European Congress of Psychology, 
Istanbul, Turkey, July, 2011. 

Geisinger, K. F. Validation: Its role in Test Reviews at the Buros Center for Testing. In S. Sireci (Chr.), Validating 
educational and psychological tests; Theory, applications, and future directions. Symposium presented at 
the biannual meeting of the European Congress of Psychology, Istanbul, Turkey, July, 2011. 

Byrne, B. M., Geisinger, K. F. & Oakland, T. The work of the International Test Commission. Symposium 

presented at the Fifth Brazilian Congress of Assessment Psychology, Bento Goncalves, Brazil, June 2011. 

Geisinger, K. F. Scientific Publication in Psychological Assessment: Challenges toward the internationalization of 
the knowledge. In E. Remor (Chr.), Scientific Publication in Psychological Assessment: Challenges toward 
the internationalization of the knowledge. Symposium presented at the Fifth Brazilian Congress of 
Assessment Psychology, Bento Goncalves, Brazil, June 2011. 
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Geisinger, K. F. The Scholarly Evaluation of Tests and Assessments. Invited keynote address presented at the Fifth 
Brazilian Congress of Assessment Psychology, Bento Goncalves, Brazil, June 2011. 

Chin, T. Y., Geisinger, K. F. & Yang, Y. (2011). Classification Accuracy of Diagnostic Methods: A Simulation 

Study. Paper presented at the annual meeting of the National Council on Measurement in Education. New 
Orleans, LA, April, 2011. 

Geisinger, K. F. (2011). The history of the Burns Center for Testing. In T. Patelis (Chr.). Perspectives on the 

history of testing in the United States. Symposium presented at the annual meeting of the National Council 
on Measurement in Education. New Orleans, LA, April, 2011. 

Geisinger, K. F. (2011). Some thoughts on the breadth of educational and psychological testing. Invited lecture at 
the University of Kansas, Lawrence, KS. March, 2011. 

Geisinger, K. F. The Buros Center for Testing and its Admissions Testing Initiatives. Invited paper presented to the 
students of the SRAM program, UNL, Lincoln, NE, November 2010. 

Geisinger, K. F. Classical test theory. In Graduate Students Issues Committee Special Invited Session on Advanced 
Measurement and Statistics. Symposium presented at the annual meeting of the Northeastern Educational 
Research Association, Hartford, CT, October, 2010. 

Geisinger, K. F. Alternate assessment: Should assessment drive instruction? Paper presented at the annual meeting 
of the Northeastern Educational Research Association, Hartford, CT, October, 2010. 

Geisinger, K. F. Dissertations: Hurdles, pathways or gateways. In T. Patelis (Chr.) On Finishing and Further: 

Dissertation Research Now and Then. Symposium presented at the annual meeting of the Northeastern 
Educational Research Association, Hartford, CT, October, 2010. 

Geisinger, K. F. Reviewing tests: A comprehensive approach. Presentation to the College Board Research and 
Development staff, Newtown, PA, October, 2010. 

Geisinger, K. F. Reviewing manuscripts for Applied Measurement in Education. Presentation to the College Board 
Research and Development staff, Newtown, PA, October, 2010. 

Geisinger, K. F. Consequences and validity. In E. Burke (Chr.) Reconsidering Messick: Validity and best practices 
in testing. Symposium presented at the biennial meeting of the International Test Commission, Hong 
Kong, July 2010. 

Geisinger, K. F. An American Psychometrician’s Perspective. In M. Ph. Bom (Chr.), Informing about ISO 10667- 
An International Standard for Assessment Service Delivery in Work and Organizational Settings. 
Symposium presented at the biennial meeting of the International Test Commission, Hong Kong, July 
2010. 

Geisinger, K. F. Evaluating Test Quality as Users and Writing Manuals as Authors: Two Sides of a Coin. Workshop 
presented at the biennial meeting of the International Test Commission, Hong Kong, July 2010. 

Geisinger, K. F. College Admissions Testing for Student Selection: Challenges for Deans and Vice President. 

Paper presented to the Hochschulrektorsconferenz (Conference of University Presidents). Bonn, Germany, 
December, 2009. 

Geisinger, K. F. College admissions testing at German universities: What might such testing look like and what 
advantages and disadvantages might it bring? Paper presented to the RWH (University of Aachen) 
Psychology Department, December 2009. 

Geisinger, K. F. Concepts of validity. In Patelis, T., Conceptions of validity. Symposium presented at the annual 
meeting of the Northeastern Educational Research Association, Hartford, CT, October, 2009. 
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Geisinger, K. F. How to get published. In K. Huff (Chair), Symposium for New Faculty Members. Symposium 
presented at the annual meeting of the Northeastern Educational Research Association, Hartford, CT, 
October, 2009 

Geisinger, K. F. Testing Issues and Concerns. Invited presentation to the Career and Technical Education State 
Collaborative Working Group of the Council of Chief State School Officers. Baltimore, MD, October 
2009. 

Geisinger, K. F. Paper-and-pencil vs. Computer-based Test Delivery. Invited presentation to the Career and 

Technical Education State Collaborative Working Group of the Council of Chief State School Officers. 
Baltimore, MD, October 2009. 

Geisinger, K. F. A College Admissions Question: What would we do if the SAT and ACT did not exist? Paper 
presented at the annual meeting of the Northern Rocky Mountain Educational Research Association, 
Jackson Hole, October, 2009. (Also presented to the QQPM Seminar at the University of Nebraska- 
Lincoln, November, 2009). 

McCormick, C. M. & Geisinger, K. F. When do testing accommodations give an unfair advantage? A Comparison 
to a double-amputee sprinter's quest to compete in the Olympics. Paper presented at the annual meeting of 
the Northern Rocky Mountain Educational Research Association, Jackson Hole, WY, October, 2009. 

Foley, B.P., Geisinger, K.F., Roschewski, P., & Foy, E. (2009, October). Conducting an alignment study in the 

context of a performance assessment with a single writing prompt. Paper presented at the Annual meeting 
of the Northern Rocky Mountain Educational Research Association, Jackson Hole, WY. 

Geisinger, K. F. Non-Traditional Admissions Measures in Higher Education: Some Comments. In P. Kyllonen 
(Chr.), New constructs and new measures in higher education admissions. Symposium presented at the 
annual meeting of the American Psychological Association, Toronto, CA, August, 2009. 

Geisinger, K. F. Research on the SAT-Writing Test. Discussant Comments in W. Camara (Chair), The SAT Writing 
Test: An Update on Research. Toronto, CA, August, 2009. 

The Burns Institute of Mental Measurements Test Review Process (With J. F. Carlson.) In D. Bartram (Chr.) 

Symposium on national approaches to test quality assurance. Symposium presented at the 11 th Biannual 
European Congress of Psycholgy, Oslo, NO, July, 2009. 

Status update on the revision of the US Joint Standards on Testing. In E. Burke (Chr.), International guidelines and 
standards related to tests and testing. Symposium presented at the 11 th Biannual European Congress of 
Psycholgy, Oslo, NO, July, 2009. 

An educational testing perspective on the ITC testing quality control guidelines in scoring, analysis and reports. In 
A. Allalouf & M. Bom (Co-Chrs.) The development of ITC guidelines on quality control in scoring, 
analysis and reports. Symposium presented at the 11 th Biannual European Congress of Psycholgy, Oslo, 
NO, July, 2009. 

Eosinophilic Esophagitis: Interobserver Variability in a Disease Entity in Which Counting Counts. (With J. F. 

Busier, N. Patel, I.D. Hill, & K.R. Geisinger). Poster presented at the annual meeting of the United States 
and Canadian Academy of Pathology, Boston, March 2009. 

American Psychological Association Science Agenda Goals. (With M. L. Cooper). Workshop presented to the 
Coalition of Academic, Scientific, and Applied Research Psychologists, American Psychological 
Association Building, Washington, DC, Feb. 19, 2009. 

The Burns Center for Testing at the University of Nebraska. Invited address at the University of Aachen, Aachen, 
Germany, December, 2008. 
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Adjusting standards to enhance validity: Post standard-setting panel considerations to enhance validity. (With C. 
McCormick.) In K. Huff (Chr.), Validating Standards on Educational Tests, Symposium presented at the 
annual meeting of the Northeastern Educational Research Association, Rocky Hill, CT, October 2008. 

A focus and follow-up on fairness. In T. Patelis (Chr.), The Fordham Five’s Fundamentals of Fairness. Paper 
presented at the annual meeting of the Northeastern Educational Research Association, Rocky Hill, CT, 
October 2008. 

Testing issues and concerns: An introductory presentation to the State Directors of Career and Technical Education. 
Invited Keynote address to the biannual meeting of the State Directors of Career and Technical Education, 
Mystic, CT, October, 2008. 

Three significant roles in the teaching of measurement: Mentor, administrator, and campus consultant. Jacob 

Cohen Award Speech invited at the annual meeting of the American Psychological Association, Boston, 
MA, August, 2008. 

Issues in cross-cultural testing: The future was yesterday. In B. Byrne (Chr.).Interplay of cross-cultural 

comparisons and related methodological practices. Symposium presented at the annual meeting of the 
American Psychological Association, Boston, MA, August, 2008. 

The rights and responsibilities of test takers and test makers. Invited keynote address to the biannual meeting of the 
International Testing Commission, Liverpool, Eng., July, 2008. 

Anne Anastasi’s views on ability and achievement: Implications for the training of measurement professionals. In 
T. Patehs (Chr.), The legacy of Anne Anastasi on educational research and assessment: Commerating the 
100 th anniversary of her birth. Symposium presented at the annual meeting of the American Educational 
Research Association, New York NY, March, 2008. 

Current validation practice for academic achievement tests. (With C. McCormick & A. Romhild.) Paper presented 
at the annual meeting of the National Council on Measurement in Education, New York NY, March, 2008. 

Timeliness in meeting the testing standards. In T. Patelis (Chr.), Maintaining quality in large-scale assessment 

(a k.a. Maintenance Schedules: They’re not just for your car.) Symposium presented at the annual meeting 
of the Association of Test Publishers, Dallas, TX, March, 2008. 

An international standard (ISO) for assessment in work and organizational settings. (With D. Bartram & W. 

Camara.) Symposium presented at the annual meeting of the Association of Test Publishers, Dallas, TX, 
March, 2008. 

The historical and present role of the Buros Institute of Mental Measurements. In K. F. Geisinger (Chr.) Test 
evaluation in the 21 st Century. Symposium presented at the annual meeting of the Association of Test 
Publishers, Dallas, TX, March, 2008. 

From the Bronx through New Brunswick to Lincoln, Nebraska: Critical Questions in the Review of Tests. (With J. 
F. Carlson). Paper presented at the annual meeting of the Northeastern Educational Research Association, 
Hartford, CT, October, 2007. 

Implications of the Spellings Commission for Outcomes Assessment in Higher Education. Paper presented at the 

annual meeting of the Northern Rocky Mountain Educational Research Association Meeting, Jackson Hole, 
WY, October 2007. 

Assessment after the Spellings Commission. Paper presented as an after-dinner presentation to the Academic 

Leadership Dinner held at the annual meeting of the American Psychological Association, San Francisco, 
CA, August, 2007. 
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Improving test use. In L. Strieker (Chair), Improving Test Use. Discussant comments presented at a symposium 
presented at the annual meeting of the American Psychological Association, San Francisco, CA, August, 
2007. 

Changes in the Verbal Test of the GRE. In K. F. Geisinger (Chair), The Revised Graduate Record Examination 

General Test—Requisite Knowledge. Paper presentation in a symposium presented at the annual meeting 
of the American Psychological Association, San Francisco, CA, August, 2007. 

The future of high stakes testing. Keynote address at the Barbara Plake Festschrift Celebration, Lincoln, NE, May, 
2007. 

Investigating students with disabilities on the SAT. In D. L. Morgan (Chair), Investigating students with disabilities 
on the SAT. Symposium presented at the annual meeting of the American Educational Research 
Association, Chicago, IL, April, 2007. 

Non-cognitive predictors and academic success. In A. E.. Schmidt (Chair), The use of non-cognitive measures for 
guidance and selection. Discussant comments presented at an invited symposium presented at the annual 
meeting of the American Psychological Association, New Orleans, LA, August, 2006. 

Changes in Large-Scale Admissions Measures in American Higher Education: Implications for Test Adaptation. 
(With D. G. Payne). Invited paper presented at the fourth biannual conference of the International Test 
Commission, Brussels, BE, July, 2006. 

The New GRE Test. In D. G. Payne (Chair), The New GRE General Test and GRE 2005Volume Report. 

Symposium presented at the annual meeting of the Council of Graduate Schools, Palm Springs, CA, 
December, 2005. 

The New GRE Test. (With D. Piacentino.), The New GRE General Test. Paper presented at the annual meeting of 
the Association of Texas Graduate Schools, Lubbock, TX, October, 2005. 

The New GRE Test. In D. G. Payne (Chair), The New GRE General Test and GRE 2004 Volume Report. 

Symposium presented at the annual meeting of the Council of Graduate Schools, Washington, DC, 
December, 2004. 

Development of a Statement of Test Taker Rights and Responsibilities. In N. Abeles (Chair), Ethical Issues in 
Assessment. Invited symposium presented at the annual meeting of the American Psychological 
Association, Honolulu, HI, August, 2004. 

Revisions to the GRE General Test. In D. Johnson (Chair), Use of the GRE and the Analytic Writing Measure in 
Master’s and Ph.D. Programs: Views from the Field. Symposium at the annual meeting of the Council of 
Graduate Schools, San Francisco, December, 2003. 

An Update on the Graduate Record Examination. Presentation at the annual meeting of the Association of Texas 
Graduate Schools, San Angelo, TX, September, 2003. 

An Administrative Perspective on Part-Time Faculty Members: The Issue of Best Utilizing Adjuncts. In J. F. 

Carlson (Chair), Don’t quit your day job: Perspectives on Part-Time Teaching. Symposium presented at 
the annual meeting of the American Psychological Association, Toronto, ON, August 2003. 

Psychometric Issues in Testing Individuals with Disabilities: Instructional Validity. Invited Keynote Symposium 
entitled “High Stakes Testing: Challenges, Victories and Best Practices” at the annual meeting of the 
International Dyslexia Association, Atlanta, GA, November, 2002. 

Some thoughts on Dr. Thomas F. Donlon, My Friend and Mentor. Acceptance remarks upon receipt of the Thomas 
F. Donlon Award, presented at the annual meeting of the Northeastern Educational Research Association, 
Kerhonksen, NY, October, 2002. 
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Anne Anastasi’s continuum of experiential specificity for tests of developed ability and the current SAT 

controversy. Paper presented in the Tribute to Anne Anastasi Symposium at the American Psychological 
Association, Susana Urbina (Chair), Chicago, IL, August, 2002. 

Some language issues in educational and psychological testing. Paper presented at the annual meeting of the 
American Psychological Association, San Francisco, August, 2001. 

Some Thoughts on the Matter of Flagging: Reactions to a Trial. Paper presentation to the annual meeting of the 
National Council on Measurement in Education, Seattle, WA, April, 2001. 

Some issues in the college use of Advanced Placement tests. (With D. DePerro.) Invited presentation to the annual 
meeting of the Middle States Regional Council of the College Board, Baltimore, MD, February, 2000. 

Testing individuals who do not fit the mold. Invited presentation to the Psychometrics program, University of 
Massachusetts, Amherst, MA, October, 1999. 

Considerations in adapting intelligence tests: A focus on the Wechsler Tests. Invited presentation at the Joint 

European Conference of the International Association for Cross-Cultural Psychology and the International 
Test Commission., Graz, Austria, June, 1999. 

A review of some Spanish-language adaptations of some English-language intelligence tests. Keynote address 
presented at the International Conference on Test Adaptation: Adapting Tests for Use in Multiple 
Languages and Cultures, Washington, DC, May, 1999. 

Psychometric issues in achieving equity in psychological assessment. In J. Sandoval (Chair), Test interpretation and 
diversity: Achieving equity in psychological assessment . Symposium presented at the annual meeting of 
the American Psychological Association, San Francisco, CA, August, 1998. 

Some Summative Thoughts on Sternberg’s Paper and the Validity of the Graduate Record Examination in Graduate 
Admissions. In A. R. Fitzpatrick (Chair), Evaluating the predictive validity of the Graduate Record 
Examination . Symposium presented at the annual meeting of the American Psychological Association, San 
Francisco, CA, August, 1998. 

An interprofessional project on rights and responsibilities of test takers. In H.E. Roberts-Fox (Chair), Test-taker 
rights and responsibilities: Issues and perspectives . Symposium presented at the annual meeting of the 
American Psychological Association, San Francisco, CA, August, 1998. 

Faculty use of the GRE in graduate admissions: What is the validity? Paper presented at the annual meeting of the 
Northeastern Association of Graduate Schools, Baltimore, MD, April, 1998. (Also presented to the 
Technical Advisory Committee for the Graduate Record Examination, Educational Testing Service, 
Princeton, NJ, June, 1998) 

The Library of the future: One academic administrator’s reflections. Keynote address presented at the annual 
meeting of the New York State Library Assistant’s Association, Syracuse, NY, June, 1998. 

A brief history of test taker rights and responsibilities: A call for codification. In J. Noble, (Chair), The rights and 
responsibilities of test takers . Symposium presented at the annual meeting of the National Council on 
Measurement in Education, San Diego, CA, April, 1998. 

Psychometric issues involved in test interpretation for members of diverse groups. In H. Roberts-Fox (Chair), Test 
interpretation and diversity: Achieving equity in assessment . Invited symposium presented at the 
Assessment ’98: Assessment for change—Changes in assessment conference, St. Petersburg, FL, January, 
1998. 
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A multi-profession project to enumerate the rights and responsibilities of test takers. In K. F. Geisinger & W. 

Schafer (Co-chairs), Test taker rights and responsibilities . Invited symposium presented at the Assessment 
’98: Assessment for change—Changes in assessment conference, St. Petersburg, FL, January, 1998. 

Pathways to organizational diversity in the next millennium: Observations of a personnel testing specialist and a 
college administrator. Invited address to the International Training Conference on Public Personnel 
Administration: Human Resource Management—Stepping out of the Box. Doris T. McGuffey (Session 
Chair). Minneapolis, MN, September, 1997. 

Suggestions for improving test adaptation practice: Discussant comments. In H. Swaminathan (Chair), Large scale 
test adaptation projects: Designs, results, and suggestions for improving practice . Symposium presented at 
the annual meeting of the National Council on Measurement in Education, Chicago, IL, March, 1997, 

Testing accommodations for a new millennium: Computer administered testing for a changing society. Invited 
paper presented at the Invitational Conference on Testing and Higher Education, co-sponsored by 
Educational Testing Service and Xavier University, New Orleans, March, 1997. 

Development of a statement of test takers’ rights and responsibilities: Implications for Counselors. In R. Ekstrom, 
(Chair), The Work of the Joint Committee on Testing Practices . Invited Symposium at the annual meeting 
of the American Counseling Association, Orlando, FL, March, 1997. 

Selected measurement contributions of Harold E. Mitzel. In M. E. Horan, (Chair), A Tribute to Harold E. Mitzel: A 
founder of NERA and a leader in educational research. Symposium presented at the annual meeting of the 
Northeastern Educational Research Association, Ellenville, NY, October, 1996, 

Development of a statement of test takers’ rights and responsibilities. Paper presented as Introductory Remarks to 
the Open Conference on Test Taker Rights and at the national headquarters of the American Speech 
Language Hearing Association, Rockville, MD, October, 1996. 

The civil service testing of Hispanics. Invited presentation to the Personnel Testing Council of Metropolitan 
Washington, Washington, DC, September, 1996. 

Advances in test adaptation. Discussant comments. In W. Camara (Chair), Adapting and translating educational 
and psychological tests: Issues, technical advances, and guidelines . Symposium presented at the annual 
convention of the American Psychological Association, Toronto, CA, August 1996. 

Testing people who do not fit the mold. Presidential Scholarly and Creative Activity Award Address presented 
at the annual Quest conference, Oswego, NY, April, 1996. 

The rights of test takers. Paper presented at the California Test Bureau/McGraw-Hill, Monterey, CA, February, 
1996. Also presented at Educational Testing Service, Princeton, NJ, June, 1996. 

The rights of test takers: A brief history. In K. F. Geisinger, (Chair), The rights of test takers . Symposium 

presented at the annual meeting of the American Speech Hearing Language Association, Orlando, FL, 
December, 1995. 

The Joint Committee on Testing Standards. In S. Goldsmith, (Chair), The ABC’s of School Testing: A 

Videotape . Symposium presented at the annual meeting of the American Speech Hearing Language 
Association, Orlando, FL, December, 1995. 

The development of a statement of test taker rights. In W. D. Schafer (Chair), Test taker rights . Symposium 

presented at the annual meeting of the National Council on Measurement in Education, San Francisco, 
April, 1995. 

Reactions from a member of the development committee. In C. B. Schmeiser, (Chair), Making the ideal real: 

Dissemination and use of the NCME Code of Ethics . Symposium presented at the annual meeting of the 
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National Council on Measurement in Education, San Francisco, April, 1995. 

Psychometric and policy issues in the use of tests with individuals with disabilities. Paper presented at the Joint 
Conference on Disability Issues sponsored by the American Bar Association, the Association of 
American Law School, the Law School Admission Council, and the National Conference of Bar 
Examiners, St. Louis, MO, April, 1995. 

A consideration of graduate education. In V. Hall, (Chair), Graduate education in psychology . Symposium 

presented at the annual meeting of the Northeastern Educational Research Association, Ellenville, NY, 
October, 1994. 

Needed changes in the Revised Standards for Educational and Psychological Testing. In W. J. Camara (Chair), 
Revision of the Standards for Educational and Psychological Testing . Symposium presented at annual 
conference of the American Psychological Association, Los Angeles, CA, August, 1994. 

A summary of four reviews of the NCME Code of Professional Responsibility in Educational Assessment. In C. B. 
Schmeiser (Chair), Membership Forum on the Proposed NCME Code of Ethics . Symposium 

presented at the annual meeting of the National Council on Measurement in Education, New Orleans, 
LA, April, 1994. 

Who exactly are the testing police? In W. C. Camara (Chair), Enforcing Professional Standards in Measurement (or 
Do We Need the Testing Police?) . Symposium presented at the annual meeting of the National Council on 
Measurement in Education, New Orleans, LA, April, 1994. 

The Work of the Joint Committee on Testing Standards: The ABC’s of School Testing. In D. K. Smith (Chair), 

The ABC’s of School Testing: A video for parents . Invited symposium presented at the annual meeting 
of the National Association of School Psychologists, Seattle, WA, March, 1994. 

The NCME Code of Professional Responsibility in Educational Assessment: Its development and orientation. In 
K. F. Geisinger (Chair), Reactions to the NCME Code of Ethical Assessment Practices in Education . 
Symposium presented at the annual conference of the Northeastern Educational Research Association, 

Ellenville, NY, October, 1993. 

The study of psychological testing of Hispanics: A beginning with a focus on industrial applications. Address 
presented to the SUNY Oswego chapter of Sigma Xi, Oswego, NY, September, 1993. 

Two SUNY-Oswego teacher education partnerships. In J. E. Milley (Chair), “Renewing Partnerships,” 

Symposium presented at the Teach America II: Implementing Teacher Education Reform 
Conference, Washington, DC, June, 1993. 

Functions and uses of the Code of Ethical Assessment Practices in Education. In C.B. Schmeiser (Chair), 

Ethics in Educational Assessment. Symposium presented at the Council of Chief State School 
Offices 1993 National Conference on Large Scale Assessment, Assessment: Key to Systematic 
Change, Albuquerque, NM, June, 1993. 

Standards in standardization: United we stand. Keynote address presented at the annual spring seminar of the 
Counseling and Psychological Department, Oswego, NY, April, 1993. 

Ethics in the professions: The case of educational assessment. Invited keynote address at the Phi Kappa Phi 
Initiation Ceremony, Fordham University, New York, NY, April, 1993. 

Audiences, functions and uses of the Code of Ethical Assessment Practices in Education. In C. B. Schmeiser 
(Chair), NCME Code of Ethics: Reactions to a draft . Symposium presented at the annual meeting of 
the National Council on Education, Atlanta, GA, April, 1993. 
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Using subject matter experts to assess content representation: An MDS analysis. (With S. G. Sireci.) Paper 
presented at the annual meeting of the National Council on Measurement in Education, Atlanta, GA, 
April, 1993. 

Perspectives on research on teacher education. Paper presented at the annual convention at the Northeastern 
Educational Research Association, Ellenville, NY, October, 1992. 

Initial validation of placement examinations at a community college. (With D. G. Seguin & K. S. Sweeney.) 
Paper presented at the annual convention of the Northeastern Educational Research 
Association, Ellenville, NY, October, 1991. 

The psychological testing of Hispanics in industry. Paper presented at the monthly meeting of the 
Connecticut Applied Psychological Association, New Haven, CT, September, 1991. 

Testing LEP students for minimum competency and graduation. Commissioned paper for the National 

Research Symposium on Limited English Proficient (LEP) Students’ Issues: Focus on Evaluation 
and Measurement, Washington, DC, September, 1991. 

Disclosing interpreted test scores to test takers: What are the problems? In J. C. Hansen (Chair), Understanding 
test results: What should users and examinees know? Symposium presented at the American 
Psychological Association, San Francisco, CA, August, 1991. 

The graduate admissions process in psychology. Psi Chi Invited Lecture presented at the annual meeting 
of the Eastern Psychological Association, New York, NY, April, 1991. 

The metamorphosis in test vahdation. Invited address presented at the annual convention of the 
Northeastern Educational Research Association, Ellenville, NY, November, 1990. 

Selecting and evaluating a site for the annual convention. Paper presented at the annual convention of 
the American Educational Research Association, San Francisco, CA, March, 1989. 

Using standard setting data to establish operational cutoff scores. In B. H. Loyd (Chair), Practical issues in 
conducting a standard setting study . Symposium presented at the annual convention of the 
National Council on Measurement in Education, San Francisco, CA, March, 1989. 

Legal issues in test construction, validation and use. Presidential address presented at the annual 

convention of the Northeastern Educational Research Association, Ellenville, NY, November, 1988. 

Post-hoc strategies for insuring and improving content validity. Paper presented as part of a symposium 
entitled, Issues related to content validation for selection of municipal employees , at the annual 
convention of the International Personnel Management Association Assessment Council, 

Philadelphia, PA, July, 1987. 

Whither educational research? Roundtable presented at the annual meeting of the Northeastern 
Educational Research Association, Kerhonksen, NY, October, 1986. 

Grading non-cognitive student behavior: A construct validation. (With V. W. Hevem, S. J.) Paper pre¬ 
sented at the annual meeting of the American Psychological Association, Washington, DC, 

August, 1986. 

The impact of the 1985 Joint Testing Standards on civil service testing. Invited address as part of the 
Visiting Scholar Lecture Series, Department of Personnel, New York, NY, March, 1986. 

The relationship and stability of two item-bias detection indices. (With G. Locke.) Paper presented at the 
annual meeting of the American Educational Research Association, Chicago, IL, April, 1985. 
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The microcomputer as a research tool: Statistical packages. Pre-session presented at the annual 
convocation of the Northeastern Educational Research Association, Ellenville, NY, 

October, 1984. 

A questionnaire approach to college curriculum evaluation. Paper presented at the annual convocation 
of the Northeastern Educational Research Association, Ellenville, NY, October, 1984. 

Ethnic group differences in the personal biserial index. (With F. J. Breyer). Paper presented at the annual 
meeting of the American Educational Research Association, New Orleans, LA, April, 1984. 

Public personnel selection testing and the law. Invited address as part of the Visiting Scholar Lecture 
Series, New York City Department of Personnel, New York, NY, April, 1984. 

An initial classification of non-cognitive student behavior grading items. (With V. W. Hevem, S .J.) Paper 
presented at the annual meeting of the American Psychological Association, Anaheim, CA, 

August, 1983. ERIC Document No. PS 014-211 . 

The relationships of attitudes toward multiple-choice tests and convergent production, divergent pro¬ 
duction, and risk-taking. (With D. T. Horber.) Paper presented at the annual meeting of the 
American Educational Research Association, Montreal, Canada, April, 1983. ERIC Document 
No. ED 229-435. 

Sex: A moderator variable between sex role and statistics performance. Paper presented at the annual 
meeting of the Eastern Psychological Association, Philadelphia, PA, April, 1983. 

Can scientific thinking he measured? (With P. Biesmeyer & H. Koritz.) Paper presented at the annual 
convention of the National Science Teachers Association and the Society of College Science 
Teachers, Dallas, TX, April, 1983. 

Construct validation of faculty orientations toward grading: An experimental investigation of differential 
grade assignment. (With G. Locke.) Paper presented at the annual convocation of the 
Northeastern Educational Research Association, Ellenville, NY, October, 1982. 

A validation of the Veterinary Aptitude Test. Paper presented at the annual convocation of the 
Northeastern Educational Research Association, Ellenville, NY, October, 1981. 

Development of a scale to measure attitudes toward multiple-choice testing. (With D. Horber.) Paper 
presented at the annual convocation of the Northeastern Educational Research Association, 

Ellenville, NY, October, 1981. 

Cross-validation of the factor structure of the McGill Pain Questionnaire. (With L. A. Bradley, M. Byrne,. Troy, L. 
Hopson Van der Heide & E.J. Prieto.) Paper presented at the annual meeting of the Eastern Psychological 
Association, New York, NY, April, 1981. 

Grade inflation and the potential for discrimination in graduate admissions. (With D. Grudzina and M. A. 

Glynn.) Paper presented at the annual meeting of the National Council on Measurement in 
Education, Los Angeles, CA, April, 1981. 

The differential prediction of graduate school success for experimental and clinical psychology students. 

(With J. Powell-Kiman.) Paper presented at the annual meeting of the Eastern Educational 
Research Association, Philadelphia, PA, March, 1981. 

A factor analysis of teachers’ attitudes about standardized testing. Paper presented at the annual convocation 
of the Northeastern Educational Research Association, Ellenville, NY, October, 1981. 

The incremental validity of an MMPI underachievement scale in predicting academic performance. (With 
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T.J. Dignelli.) Paper presented at the annual meeting of the Eastern Psychological Association, 

Hartford, CT, April, 1980. 

The language of low back pain: Factor structure. (With L. Hopson, E. J. Prieto, L. A. Bradley, & M. Byrne.) Paper 
presented at the annual meeting of the Eastern Psychological Association, Hartford, CT, April, 

1980. 

An MMPI underachievement scale as a predictor of academic achievement among high school students. 

(With V. Hevem, S.J.) Paper presented at the annual meeting of the American Educational 
Research Association, Boston, MA, April, 1980. 

Faculty techniques for preventing cheating: Some baseline data. (With J. J. Maiorca & J. J. Naumann.) 

Paper presented at the annual meeting of the National Council on Measurement in Education, 

Boston, MA, April, 1980. 

Intra-university variations in grading: A rationale for differing standards. Paper presented at the annual 
meeting of the American Educational Research Association, Boston, MA, April, 1980. 

Grading and the psychology of motivation. Invited address to the National Conference on Higher 
Education, Washington, DC, March, 1979. 

Faculty orientations toward grading at three academic institutions. (With A. N. Wilson & J. J. Naumann.) Paper 
presented at the annual convocation of the Northeastern Educational Research Association, 

Ellenville, NY, October, 1979. 

Academic policy and faculty-related changes influencing grading standards. In K. F. Geisinger (Chair), 

University grade inflation: Documentation, causes, and consequences . Symposium presented at the 
annual meeting of the American Psychological Association, New York, NY, September, 1979. 

Individual differences among college faculty in awarding grades. Paper presented at the annual meeting 
of the National Council on Measurement in Education, San Francisco, CA, April, 1979. 

Grading policies and grade inflation. Paper presented at the annual convocation of the Northeastern 
Educational Research Association. Ellenville, NY, October, 1978. 

Individual differences in calculator attitudes and performance in a statistics course. (With D. M. Roberts.) 

Paper presented for presentation at the annual meeting of the American Educational Research 
Association, Toronto, Ontario, Canada, April, 1978. 

A systems approach to item production and review in a computer-managed instruction project. In H. E. 

Mitzel (Chair), Mobile education for nurses: Computer-based instruction in support of an extended degree 
program for registered nurses . Symposium presented at the annual meeting 
of the American Educational Research Association, San Francisco, CA, April, 1976, ERIC 
Document No. ED 121-280. 

Prayer, biographical background and college experience. Paper presented at the annual meeting of the 
Southeastern Psychological Association, Hollywood, FL, May, 1974. 

Models for the teaching of graduate-level statistics courses in psychology departments. In E. J. 

Robinson, Discussion of the role of the statistics course in psychology . Symposium presented at the 
annual meeting of the Southeastern Psychological Association, Hollywood, FL, May, 1974. 
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ADMINISTRATIVE DEVELOPMENT 

University of Indiana/Purdue University, The Fund Raising School, Center on Philanthropy. Principles and 
Techniques of Fund Raising. Houston, TX (March, 2005). 

Lilly Foundation, Building and supporting diversity at church-related colleges and universities, Seguin, TX, 
(March, 2004). 

Council of Independent Colleges and American Association of Academic Libraries, Reforming the academic 
library, San Francisco, CA, (February, 2004). 

Association for Institutional Research & Council of Independent Colleges, Data and Decisions, Denver, CO, 
(September, 2003). 

Council on Independent Colleges, Academic Vice President program entitled Leading from Within, led by Dr. 
Parker Palmer, Kalamazoo, MI (June, 1998). 

Harvard University, Institute for Educational Management (July, 1995) 

University of Massachusetts, Boston (New England Research Center for Higher Education), Defining the 
Collective Task (March, 1994) 

Council for Advancement and Support of Education (CASE), Major Gift Fund Raising for Deans (May, 1993) 

American Council on Education, Center for Leadership Development, Workshop for Department and Division 
Chairpersons and Deans (January, 1987) 

Council of Graduate Departments of Psychology, Workshops for Department Chairs (February, 1986, 1987) 


ADMINISTRATIVE ACCOMPLISHMENTS 


Buros Center for Testing 

• Brought out the 17 th and 18 th Mental Measurements Yearbook on time. 

• Organized and ran the first strategic planning in the center’s history to generate a strategic plan (2010). 

• Developed a new, international vision for the Center. 

• Published Pruebas Publicadas en Espanol: An Index of Spanish Tests in Print, the first such document 
in existence. 

• Developed plans for a new institute related to assessment literacy and effected them. 

• Developed plans to publish the first ever publication enumerating tests in Spanish. 

• Brought in new clients for assessment outreach and consultation. 

• Set a new course for the types of psychometric consultation that is appropriate given the changes in 
statewide testing under the Common Core. 

• Developed new policies for dealing with test publishers. 

• Performed outreach efforts to work with test publishers in a more effective manner. 

• Hired new directors for institutes within the Center. 

• Developed strategies to identify new test reviewers. 

• Reorganized the meeting schedule of the National Advisory Council to provide additional external 
input. 

• Converted secretarial position to student workers. 

• Brought more than two million dollars of income over expenses over the initial four-year period. 

• Began a process of succession planning. 
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• Updated the data base that keeps the Test Reviews and Information information. 

• Regularized the meeting schedule of our National Advisory Council. 

• Helped graduate assistants receive nationally prestigious summer fellowships. 

• Initiated, organized and ran the first celebration of the Center’s history. 

• Re-oriented the Buros Institute for Assessment Consultation and Outreach to focus more on equating 
and validation. 

• Filed suit against Taylor and Francis and received ownership of the journal, Applied Measurement in 
Education and more than tripled our income from this work. 

• Developed a leadership team in the Center. 

The University of St. Thomas 

• Drafted, proposed and had approved a 5-year plan for the Doherty Library. 

• Served as a primary participant in the SACS reaccreditation process that was completely successful. 

• Partnered with the Museum of Fine Arts to provide all of our students’ free membership. 

• Negotiated approval of the controversial minor academic program, Woman, Culture and Society. 

• Extended partnerships with other Houston museums to extend membership to students. 

• Developed and instituted plan to infuse clerical support for academic units. 

• Led effort with deans to develop policy on the more effective use of adjuncts. 

• Coordinated and organized Chairs Workshop, Fall, 2003. 

• Developed and instituted a plan to reduce Arts and Science faculty teaching loads. 

• Served as member of the Governing Council, Partnership for Quality Education. 

• Hired deans for Schools of Business and Theology. 

• Renegotiated contract with the Diocese of Galveston/Houston for the campus of the School of 
Theology. 

• Proposed new plan for faculty evaluation working collaboratively with the Faculty Senate. 

• Radically increased grantsmanship in Academic Affairs. 

• Advocated for faculty awards in the areas of teaching, scholarship and service. 

• Established a process whereby goals for the Core Curriculum could be identified and agreed upon, to 
lead to the evaluation of the core curriculum. 

• Chaired an ad-hoc committee that developed a new plan for hosting Political Speakers on campus. 

• Provided Faculty Study Day (Convocation) address (Fall, 2001). Coordinated Faculty Study Day each 
semester. 

• Held monthly open-houses for faculty members. 

• Conducted focused deans’ retreats each semester. 

• Developed and had approved a faculty exchange program with St. Thomas University, New Brunswik, 
CA. 

• Chaired task force on Academic Integrity. Produced recommendations for change at the University. 


JA2740 




Case l:14-cv-00857-TSC Document 60-88 Filed 12/21/15 Page 46 of 48 
USCA Case #17-7035 Document #1715850 Filed: 01/31/2018 Page 437 of 517 

Kurt F. Geisinger, Ph.D. Page 34 


• Reworked the budget of the Reagan Summer Academy so that we could continue to provide collegiate 
instruction to underserved students from a predominantly Hispanic, urban high school. 

Le Moyne College 

• Initiated the O’Brien Faculty Service Award to accompany the Teacher of the Year and Scholar of the 
Year Awards. 

• Upgraded Faculty Convocations as a true “coming together” through use of nationally recognized 
speakers and coordinated workshops. 

• Worked with the Budget Committee and the Vice President for Finance and Treasurer to set aside 
funds for academic equipment (my initiative). The budget became the first academic equipment 
budget at Le Moyne. 

• Chaired the board of a multi-university consortium (the Syracuse Consortium for the Cultural 
Foundations of Medicine). 

• Developed and implemented a faculty-run assessment plan. 

• Developed plan for, moved through governance, and initiated the Honors House across the street from 
the Campus Center. 

• Developed and submitted a strategic plan for the Academic division. Began the development of an 
academic plan for the College. 

• Developed a model to help identify the need for faculty lines in academic departments. 

• Infused significant technology into the curriculum through a faculty development program led by a 
faculty member. 

• Proposed and negotiated Academic Librarian Status, and had approved by the librarians, the Faculty 
Senate and the Board of Trustees. 

• Engaged in a re-organization of the Academic Affairs Office that led to the new Assistant Vice 
President for Multicultural Affairs position, a return to a Director of Continuing Education position, 
and elimination of the Special Assistant position. Served on a committee to re-conceptualize the 
College into (1) Arts and Sciences and (2) Management and Graduate Studies. Hired an Associate 
Dean to facilitate student-centeredness. 

• Worked with other administrators to increase the student-faculty ratio from 12.5-1 to 15-1, as called 
for by the Board of Trustees. 

• Brought about and/or enhanced on-going discussions regarding new majors in Communications, 
Environmental Science, Global Business, Management Information Systems, Theatre, Nursing, and a 
masters in Accounting. 

• Advanced discussion on campus concerning the arts, internships, international education, increased 
diversity in faculty hiring, and faculty service through my convocation speeches. 

• Worked closely with a faculty committee and with input for the Trustees’ Academic Affairs 
Committee to lead to a plan of action that led to the saving of the Physics major program. 

• Led discussion that led to plans for instructional use of an older cafeteria. 

• Hired a new and outstanding Director of the Madden Center (a business outreach center) from the local 
business community. 

• Initiated accreditation effort of our Education programs through TEAC. Led discussions culminating 
in the decision to pursue TEAC accreditation. Moved the Department of Education substantially 
ahead on several fronts through the hiring of a new, outside Department Chairperson. 

• Helped to fashion the Arab Studies program and to get it funded and running. 
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• Worked collaboratively with others to design a new Performing Arts Center and a campus archives. 

• Participated in discussions related to the development of approximately 4 smart classrooms/year. 

• Established annual budget line ($150,000) for academic (e.g., scientific and instructional) equipment, 
not including computers (which are funded from other accounts). 

• Served on a variety of American Red Cross and regional educational boards. 

SUNY-Oswego 

• Led conversion from a Division of Arts and Sciences to a College of Arts and Sciences through faculty 
governance and administrative structures. 

• Raised initial funding for a speaker series to commemorate the founding of the College of Arts and 
Sciences. This series was so successful that it was instituted permanently. 

• Increased faculty diversity significantly through hiring procedures. The percentages of ethnic minority 
and women tenure-track faculty hires during the 1992-93, 1993-94 and 1994-95 academic years were 
approximately 25% and 50%, respectively. Increased the number of women chairpersons from 0 to 3 
out of 19. 

• Initiated a Faculty Executive Committee for the College of Arts and Sciences to improve consultation. 

• Led efforts to develop new majors in Journalism, Graphic Arts, Criminal Justice (transformed from 
Public Justice), Human Services, Legal Studies, Human Development, and Language and International 
Trade. 

• Initiated and coordinated the actions leading to the chartering of Phi Kappa Phi (national honor 
society) and the first national interdisciplinary honor society at Oswego) on campus. 

• Raised non-state funds to set up a College of Arts and Sciences Faculty Development Travel Fund. 

• Initiated and conducted annual new chairperson training programs. 

• Organized a development program for all chairpersons from five campuses in the SUNY system. This 
program was evaluated by participants as being extremely effective. 

• Completed development of a major in Environmental Science. 

• Established a board of local health professionals to provide guidance to the health professions and to 
faculty in the sciences. 

• Worked with the Graduate School Dean at SUNY-Health Sciences Center to initiate a summer 
research and pre-graduate study program for advanced students in biology, chemistry, physics and 
psychology. 

• Co-chaired the Teacher Education Commission (1992-93). 

• Hired new directors for the Tyler Art Gallery and the Rice Creek Field Station. 

• Instituted more balance among teaching, service and scholarly activity through hiring policies, faculty 
development activities, and promotion and retention practices. 

• Worked as part of a team of deans to develop a more flexible faculty workload policy. 

• Initiated the effort to bring a NASA/JOVE to SUNY-Oswego and served as the administrative liaison 
on the JOVE team. This grant is the first NASA/JOVE grant in SUNY. 

• Served on the Interim Provost Search Committee (1993-94). 

• Developed a proposal, received funding for, and initiated a multi-media Language/Joumalism 
laboratory. 
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• Increased the numbers of College of Arts and Sciences students studying abroad through efforts with 
the Office of International Education. 

• Substantially updated technology within department office and science laboratories. 

• Served on numerous charitable and community boards. 
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UNITED STATES DISTRICT COURT 
FOR THE DISTRICT OF COLUMBIA 


AMERICAN EDUCATIONAL RESEARCH 
ASSOCIATION, INC., AMERICAN 
PSYCHOLOGICAL ASSOCIATION, INC., and 
NATIONAL COUNCIL ON MEASUREMENT IN 
EDUCATION, INC., 

Plaintiffs, 

v. 

PUBLIC.RESOURCE.ORG, 

Defendant. 


Case No. l:14-CV-00857-TSC-DAR 

DEFENDANT-COUNTERCLAIMANT 
PUBLIC.RESOURCE.ORG’S MOTION 
TO STRIKE ECF NO. 60-88, THE 
DECLARATION OF KURT P. 
GEISINGER IN SUPPORT OF 
PLAINTIFFS’ MOTION FOR 
SUMMARY JUDGMENT AND 
PERMANENT INJUNCTION 

Action Filed: May 23, 2014 


Defendant-Counterclaimant Public.Resource.Org, Inc. (“Public Resource”) respectfully 
moves to strike ECF No. 60-88, the Declaration f Kurt P. Geisinger In Support of Plaintiffs’ 
Motion for Summary Judgment and Permanent Injunction. 

As described in the attached Memorandum of Law in Support of Defendant’s Motion to 
Strike, Kurt P. Geisinger’s testimony includes new opinions, reasons, and facts that were not 
disclosed in his expert report and must be excluded under Federal Rule of Civil Procedure 37 . 
Further, Geisinger is not qualified to testify on the matters contained in the report under the 
standards of Federal Rule of Evidence 702 and Daubert. Mr. Geisinger’s opinions further rest 
uncritically on statements from Plaintiffs’ agents, invade the province of the court, and rest on 
unsupported assumptions, facts, and methods. For these reasons, Mr. Geisinger’s report should 
be stricken from the record, along with all citations to and quotations of that report in Plaintiffs’ 
Motion for Summary Judgment and Permanent Injunction. 

Public Resource requests an oral hearing on this motion. 
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This motion is based on the enclosed Memorandum of Points & Authorities, the 
Declaration of Matthew Becker and the exhibits attached thereto, Public Resource’s proposed 
Order, the pleadings and papers on file herein, and any further material and argument presented 
to the Court at the time of the hearing. 


Dated: January 21, 2016 


Respectfully submitted, 


/s/ Andrew P. Bridges _ 

Andrew P. Bridges (admitted) 

abridges@fenwick.com 

Sebastian E. Kaplan (pro hac vice pending) 

skaplan@fenwick. com 

Matthew Becker (admitted) 

mbecker@fenwick. com 

FENWICK & WEST LLP 

555 California Street, 12th Floor 

San Francisco, CA 94104 

Telephone: (415) 875-2300 

Facsimile: (415)281-1350 

Corynne Me Sherry (admitted pro hac vice ) 
corynne@eff.org 

Mitchell L. Stoltz (D.C. Bar No. 978149) 
mitch@eff.org 

ELECTRONIC FRONTIER FOUNDATION 

815 Eddy Street 

San Francisco, CA 94109 

Telephone: (415)436-9333 

Facsimile: (415)436-9993 

David Halperin (D.C. Bar No. 426078) 
davidhalperindc@gmail.com 
1530 P Street NW 
Washington, DC 20005 
Telephone: (202) 905-3434 

Attorneys for Defendant-Counterclaimant 
Public.Resource.Org, Inc. 
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IN THE UNITED STATES DISTRICT COURT 
FOR THE DISTRICT OF COLUMBIA 


AMERICAN EDUCATIONAL RESEARCH ) 
ASSOCIATION, INC., AMERICAN ) 

PSYCHOLOGICAL ASSOCIATION, INC., ) 
and NATIONAL COUNCIL ON ) 

MEASUREMENT IN EDUCATION, INC., ) 

) 

Plaintiffs/Counterclaim Defendants, ) 

) 

v. ) 

) 

PUBLIC.RESOURCE.ORG, INC., ) 

) 

Defendant/Counterclaimant. ) 

_ ) 


Civil Action No. l:14-cv-00857-TSC-DAR 


EXPERT’S DECLARATION AND 
REPORT OF KURT F. GEISINGER, 
Ph. D. PURSUANT TO FED. R. CIV. P. 
26(a)(2)(B) 


I, KURT F. GEISINGER, Ph. D., declare: 


1. I am currently Director of the Buros Center on Testing and W. C. Meierhenry 
Distinguished University Professor at the University of Nebraska-Lincoln. 

2. The following constitutes my expert’s report in this action on behalf of Plaintiffs, 
the American Educational Research Association, Inc. (“AERA”), the American Psychological 
Association, Inc. (“APA”) and the National Council on Measurement in Education, Inc. 


(“NCME”) (collectively, “Plaintiffs”), complaining of certain activities engaged in by 
Defendant, Public .Resource. Org, Inc. (“Public Resource”). 

3. This Declaration and Report contains my opinions to date. The basis for my 
opinions, the materials I considered in reaching my opinions, and my qualifications for rendering 
such opinions are set forth in this Declaration and attached Exhibits. I reserve the right to 
supplement my Declaration to address any additional documents and testimony introduced in this 
action that come to my attention between now and the time of any deposition, hearing or trial. 
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My Qualifications 

4. I received my doctoral degree in Educational Psychology in 1977 from the 
Pennsylvania State University, after previously receiving my masters’ degree in Psychology at 
the University of Georgia and my bachelor’s degree from Davidson College (with honors). I 
also studied German, Psychology and other topics as an undergraduate at the Phillips Universitat 
in Marburg, Germany and at Harvard University when I attended the Institute for Educational 
Management in 1995. 

5. Previously, I served as the Vice President of Academic Affairs and Professor of 
Psychology at the University of St. Thomas in Houston, Texas, where I was responsible for four 
academic schools, approximately 200 faculty members, and over 4,000 students. I also served as 
Academic Vice President and Professor of Psychology at Le Moyne College, Dean of the 
College of Arts and Sciences and Professor of Psychology at the State University of New York at 
Oswego, and Professor of Psychology at Fordham University in New York City, where I was 
department chair for the Department of Psychology and director of the Doctoral program in 
Psychometrics. 

6. Over the past forty years, I have researched, studied, and taught psychometrics. 
Psychometrics, defined in more detail later in this report, is the quantitative study of tests and 
measures in terms of the value, usefulness, and interpretation of the results of such measures. I 
also am a fellow, diplomate, and member of numerous professional societies involving 
educational and psychological testing, such as the APA (fellow), the American Association for 
Assessment Psychology (diplomate), the AERA (fellow), and the NCME, as well as other 
professional associations. I have represented the APA by serving on and chairing the Joint 
Committee on Testing Practices (which is separate from the joint committee of the AERA, the 
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APA and the NCME responsible for the 1999 Standards for Educational and Psychological 
Testing ) and have served on the APA’s Committee on Psychological Tests and Assessment. In 
2010,1 was elected to serve two terms (2006-2008 and 2009-2011) as the representative on the 
Council of Representatives for the APA’s Division of Evaluation, Measurement and Statistics. 
My second term was cut short by one year when I was elected to serve as a member-at-large on 
the APA’s Board of Directors in 2010, a position I held for a three-year term (2011-2013). 

7. I have authored numerous publications about psychological and educational 
testing. I have worked at the Educational Testing Service (“ETS”), chaired its Technical 
Advisory Committee for the Graduate Record Examination (“GRE”), served on the Board of 
Directors for the GRE (a Board that I also chaired), and have been a member of the College 
Board, (formerly known as the College Entrance Examination Board) for which I served on its 
SAT Committee (from 2000-2002). I recently concluded a four-year term (from 2011-2014) on 
the Advisory Research Committee for the College Board, serving the last two years as its chair. 

I currently serve on the Technical Advisory Committee for the Educational Records Bureau. 1 

8. In 2010, I was elected to the Council (i.e., Board of Directors) for the 
International Test Commission—the primary international testing body. In 2012, I also was 
elected as its Treasurer and to serve on its Executive Council. I am the only American on its 
Executive Council. 

9. I was asked to review and share my comments on chapters of the 1999 Standards 
for Educational and Psychological Testing, published jointly by the AERA, the APA, and the 


1 The Educational Record Bureau specializes in the development and use of tests and testing 
products for private and independent educational institutions at the p-12 levels. 
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NCME (the “1999 Standards”)- The Joint Standards 2 embody the professionally accepted 
practices for testing and measurement. One of the chapters I reviewed was based upon the 
testing of individuals with disabilities, an area in which I have engaged in research and have 
served as an expert witness in federal courts as well as state courts in New York, New Jersey, 
and California. The other chapter related to the rights and responsibilities of test takers. See 
Exh. A. I note that the Joint Standards were revised in 2014. 

10. In addition to my 130 plus journal articles and book chapters, I have written, 
edited, or co-edited approximately 15 books and monographs. The vast majority of these 
publications deal with testing and measurement issues. For example, I have edited two books on 
the psychological testing of Hispanics and another I co-edited related to fairness in testing. I also 
have co-edited several books of reviews of published tests and measures. I also was Editor-in- 
Chief for the three-volume Handbook of Testing and Assessment in Psychology (published by the 
APA in 2013). Additionally, I have been editor of the journal Applied Measurement in 
Education for the past 8 plus years. Taylor & Francis, in conjunction with the Buros Center for 
Testing, publishes this journal. 

11. I also co-chaired a sub-committee of the APA’s Joint Committee on Testing 
Practices and the overall committee itself that developed a document on the rights and 
responsibilities of test takers (from 1993-2001). This document has been endorsed by a number 
of professional associations related to proper test use, including the APA, the National 
Association of School Psychologists, the American Counseling Association, and the NCME. 
While chairing the Joint Committee on Testing Practices, the committee developed a book 
entitled Assessing Individuals with Disabilities, in which I wrote a chapter. I also served on a 

2 I use the term Joint Standards to refer to the Standards for Educational and Psychological 
Testing as a whole, not a specific version of the Standards, i.e. 1999 or 2014 
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task force charged to illuminate issues related to the testing of individuals with disabilities as 
well as ethnic minorities. The task force wrote and edited a book entitled Test Interpretation and 
Diversity: Achieving Equity in Assessment, which was published by the APA’s publication unit 
in 1997. I authored three chapters in that volume. 

12. I additionally served on an APA task force (from 2007-2010) that considered the 
assessment and intervention of individuals with disabilities. The results of our work, Guidelines 
for the “Assessment of and Intervention with Individuals with Disabilities,” was published in the 
American Psychologist, the premier publication of the APA (Geisinger et al., 2012) and endorsed 
as the policy of the APA by its governance. A reference for the American Psychologist article 
may be found on my curriculum vitae, which is attached as Exhibit A. 

13. In the past two years (2014-2015), I have served on two task forces related to the 
use of measures in clinical psychology. One of these has written a policy, recently accepted by 
the APA’s Board of Directors, that differentiates the use of tests and other measures, for 
screening and assessment, two highly related types of testing, but which differ in specificity and 
focus. Tests are usually standardized measures that are given to a number of people for a 
specific purpose. A bar examination would be an example of a test. Measures are other 
typically quantitative values used to evaluate a person and include tests. A bathroom scale 
results in a measure (weight), but would not normally be considered as a test. 

14. During 2013-2014, I served on a committee of the Institute of Medicine (a 
component of the National Academy of Sciences) that evaluated the use of psychological and 
clinical neuropsychological measures by the Social Security Administration in determining 
disability status. The final report, entitled Psychological Testing in the Service of Disability 
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Determination, is in the process of being published, but is also available from the Institute of 
Medicine’s website. 

15. For approximately four years (from 2008-2012), I jointly represented three 
professional associations (the AERA, the APA, and the NCME) in developing the International 
Organization for Standardization’s (“ISO”) first standard on psychological testing. The results 
of the work of the committee that engaged in this activity was ISO Standard 10677. The 
standard is divided into two parts. The first part establishes requirements and guidance for a 
client working with a service provider to carry out the assessment of an individual, a group, or an 
organization for work-related purposes. ISO 10667-1:2011 enables the client to base its 
decisions on sound assessment results. ISO 10667-1:2011 also specifies the responsibilities of a 
service provided in terms of the assessment methods and procedures that can be carried out for 
various work-related purposes made by or affecting individuals, groups or organizations. The 
second part lays out the responsibilities of the service provider in terms of the same assessment 
project. 

16. I also developed or helped to develop a number of testing measures. Specifically, 

I served as the primary consultant on a number of civil service examinations given in New York 
City for police officer, sergeant, lieutenant, and captain, fire fighter, fire lieutenant, fire captain, 
sanitation supervisor, and a variety of other civil service occupations over a period of at least a 
decade ending in 1992. I sometimes defended these measures in court. I also represented the 
Public Service Alliance of Canada against the Public Service of Canada in two cases related to 
their national testing efforts and Disability Rights Advocates with regard to several testing 
disputes concerning individuals with disabilities. See Exh. A. 
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17. In recent years, my primary efforts have been to assure testing fairness for those 
with disabilities, language minorities, and ethnic minorities. 

18. My curriculum vitae is attached to this Declaration and Report as part of Exhibit 
A. 

19. A list of all publications that I have authored in the past 10 years is included in 
my curriculum vitae. See Exh. A. 

20. To the best of my memory, during the past 4 years, I have not testified as an 
expert at trial or by deposition. Previously, I have been accepted as an expert on testing in state 
courts in New York, New Jersey, and California, and in federal courts in New York, New Jersey, 
and Canada. Within the past four years (2011-2015), I have been identified as an expert in cases 
that were settled prior to trial. I wrote a report on the use of testing to deny an individual with 
disability benefits for a state agency in Nebraska this past fall, but the matter was resolved prior 
to going to court or arbitration. In none of these cases have I given deposition testimony. 

21. To explain what psychometrics is, I provide below the first two paragraphs of my 
entry in the Corsini Encyclopedia of Psychology (2010, Wiley, 3rd edition) on the topic of 
“Psychometrics: Norms, Reliability, Validity, and Item Analysis”. 

The field of psychometrics generally considers the data from educational and 
psychological tests and assessments from a quantitative perspective. Such data normally 
emerges from test responses, although it may come from a wide variety of measurement 
ins truments. Two divisions might be identified within psychometrics: theoretical and 
applied psychometrics. Psychometric theory (as portrayed by Embretson & Reise, 2000; 
Lord, 1980; McDonald, 1999; Nunnally, 1978) provides researchers and psychologists 
with mathematical models used in considering responses to individual test items, entire 
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tests and sets of tests. Applied psychometrics is the implementation of these models and 

their analytic procedures to test data (e.g., Thorndike, 1982). 

22. For over five years (1989-1995), I taught courses for the Cornell University 
School of Industrial and Labor Relations on the topic of affirmative action and equal opportunity 
hiring. These courses related to the use of tests in a fair and valid way to make personnel 
decisions such as hiring and promotion while attempting to increase diversity and to be in 
conformance with federal laws and guidelines. 

23. I have had past and ongoing relationships, as a member or fellow, with each of the 
three Plaintiff associations. I currently serve as a committee chair of one of AERA’s divisions’ 
(Division D - Measurement and Research Methodology) International Committee. I have 
presented at AERA’s annual conferences regularly. 

24. I have served on and chaired NCME’s professional development committee 
(1990-1992), served as a program co-chair for its annual meeting (1993), ran for its board (and 
was defeated) (1993), represented it on the committee that developed its code of professional 
conduct (ethics), and was a representative and advisory board member of a doctoral program in 
psychometrics that was being developed at Morgan State University (2007-2012). I have 
published in several of NCME’s journals and have served on the editorial committee for the 
journal, Educational Measurement: Issues and Practice (1992-1995). 

25. I also was elected, and have served, as a member of APA’s Committee on 
Psychological Tests and Assessment (1998-2000); on its Committee on International Relations in 
Psychology (2010); on its Joint Committee on Testing Practices (1992-1996), on its Council of 
Representatives (two term s from (2006-2010) representing the Division of Measurement, 
Evaluation and Statistics; and on its Board of Directors. I was appointed to serve on APA’s 
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Good Governance Task Force that prepared a plan to reorganize its governance. I also was 
appointed to serve on perhaps a half dozen APA task forces over the years related to testing 
issues of one type or another (e.g., the testing of individuals with disabilities, the testing of 
individuals who are ethnic minorities, the use of testing in clinical psychology). I served for 
eight years (from 1992-2000) as a member of APA’s editorial board for its journal, 
Psychological Assessment, and recently served on the committee that selected a new editor for 
that journal. Further, I served all three of these organizations by representing them on an 
American National Standards Institute (“ANSI”)/Intemational Organization for Standardization 
(“ISO”) committee that developed an international standard for industrial testing. 

Materials that I have Considered 

26. A list of the facts, data, and materials that I have considered in forming my 
opinions in this case is attached to this Declaration and Report as Exhibit B. 

My Opinions Relevant to this Case, and the Basis and Reasons for my Opinions 

27. The Joint Standards serve as the foundation for the testing profession. It is the 
most authoritative single source about the best practices in testing. Other associations (i.e., the 
International Test Commission and the American Counseling Association) have much shorter 
and less comprehensive guidelines or standards related to testing or certain aspects of testing, but 
none have achieved the prominence that the Joint Standards currently enjoy. Also, none have 
the pervasive influence across different aspects of the use of tests. That is, the Joint Standards 
are appropriate in a wide range of diverse clinical, counseling, educational, and industrial 
settings with a variety of populations. It is because of the widespread respect in which the Joint 
Standards are held that they are sometimes cited in court cases. Elaborations of this conclusive 
statement follow. 
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28. I currently direct and have directed for the past nine years the Buros Center for 
Testing, formerly known as the Buros Institute of Mental Measurements. It was founded some 
80 years ago by Oscar Buros, then a faculty member at the Rutgers University, to be essentially 
the Consumer Reports of the testing industry. The Buros Center publishes comprehensive, 
critical reviews of testing. These reviews are available through our published volumes entitled 
Mental Measurement Yearbooks. I have spoken with the other editors of our primary document, 
the Mental Measurements Yearbook, and we agree that the most commonly cited document in 
the reviews of tests is the Joint Standards. Those who review tests and testing practices refer to 
the Joint Standards constantly, and this is reflected in our Mental Measurements Yearbooks. We 
believe it to be the most frequently used comprehensive yardstick against which the quality of 
tests and measures and the quality of test use is evaluated. 

29. In the 1990s, I co-chaired APA’s Joint Committee on Testing Practices (which 
had no direct or formal relationship with the joint committee of the AERA, APA and NCME that 
develops and revises the Joint Standards). The role of this committee on testing practices was to 
develop documents and products that could improve testing practices. One document that the 
committee developed and subsequently revised was entitled the Code of Fair Testing Practices 
in Education, a very brief document written for parents and users of educational tests alike. The 
members of the Joint Committee on Testing Practices agreed that the principles espoused in the 
Code of Fair Testing Practices in Education had to be consistent with the Joint Standards (given 
their pre-eminent status). I also co-chaired a working group of that committee, which developed 
a document entitled the Rights and Responsibilities of Test Takers. We began our work this 
document by reviewing everything that the Joint Standards had to say about this aspect of 
testing. 
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30. In the multiple editions of the Joint Standards, various psychometric concepts 
have evolved or changed. Most testing experts believe that the single most important quality in 
testing is the validity of test scores—that they are used and interpreted in appropriate and useful 
ways. One can trace the history of our profession’s perceptions of validity and how it may be 
estimated and determined by studying its portrayals throughout the seven versions of the Joint 
Standards that have been published to date. 

31. Background of the Standards for Educational and Psychological Testing, and its 
importance to the testing professions : The history of the Joint Standards is not brief. This 
history reflects changes in psychometric technology, our understandings of testing, and 
psychological/educational characteristics, societal fluctuations, technological improvements, as 
well as general Zeitgeist differences. The first set of standards was published in 1954 by the 
APA and was entitled, Technical Recommendations for Psychological Tests and Diagnostic 
Techniques. Its impact on the testing field was monumental. Shortly after the publication of this 
volume in 1955, AERA and NCMUE (the original name of NCME was the National Council on 
Measurements Used in Education) published a similar document devoted almost exclusively to 
educational measures entitled, Technical Recommendations for Achievement Tests. There had 
been collaboration across these organizations in the development of these two initial documents 
(Eignor, 2013). The three organizations subsequently decided that their work should continue 
collaboratively. The two preceding documents and their first joint effort all described what test 
publishers should include in their test-related documentation and, to generate such information, 
what research efforts should be made in test development and use. 

32. In 1963, the Joint Committee (across the three associations - AERA, APA and 
NCME) was formed, and the first Joint Standards were ultimately published in 1966 as the 
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Standards for Educational and Psychological Tests and Manuals (the “1966 Standards”). The 
next effort began only five years after the publication of the 1966 Standards. The Joint 
Committee worked from 1971 through the publication of their revised Standards in 1974 (the 
“1974 Standards”). This 1974 set of test standards focused less on documentation and more on 
topics such as how tests should best be developed, used, scored, and results reported. To 
emphasize the reduced focus on documentation, the title of the 1974 Standards was changed to 
Standards for Educational and Psychological Tests. 

33. In the early 1980s, still another Joint Committee was empaneled and charged with 
the revision of the Joint Standards, a process that concluded with the publication of the 1985 
Standards for Educational and Psychological Testing (“the 1985 Standards”). This minor 
change in the title from “tests” to “testing” emphasizes the changed focus from the tests 
themselves to the process of, use of, and interpretation of tests and the results of testing. 

34. The 1999 Standards in question in this case held with the same title. The 
development process of the 1999 Standards began with open meetings where many individuals 
were able to speak to the Joint Committee to provide input into the ways that they believed the 
Joint Standards should change. I was one such participant at meetings in Alexandria, YA in 
October of 1994. Whereas the time needed to write the Test Standards had been approximately 3 
years prior to the 1999 Standards, it appears that the time was closer to 4 years before the 1999 
Standards became available. 

35. The revision process for the current Standards that were published in 2014 was 
longer, although the final publication was delayed due to the present dispute with Public 
Resource. The Joint Committee was formalized in or around 2007, the first meeting was held in 
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January 2008, and the Committee completed its work in 2013. This revision process took five 
years. 

36. The members of each Joint Committee are all volunteers, but staff support is 
needed. Not counting the very real costs of contributed staff support, the budget for the Joint 
Committee’s meetings was approximately $400,000. Had the staff salaries been covered in this 
budget, and had members of the Joint Committee received even minimal compensation for the 
work they performed, the budget for revising and updating the Joint Standards would have been 
approximately $2,000,000. These costs will continue to increase. 

37. There are several reasons I expect such cost increases. Testing is becoming both 
increasingly complex and increasingly technological. For example, 20 years ago all testing was 
“in person” or “paper-and-pencil.” Now there is testing via computer, testing via tablet, and 
testing via phone, but the “in-person” and “paper-and-pencil” testing method still continues. 
Unproctored testing on the internet is presently the most common type of personnel selection 
testing in non-public settings in the United States. In northern Europe, the most common type of 
testing is now via the internet. Secondly, society is putting increasing emphasis upon 
measurement concerns. Teachers are being evaluated in many settings based upon how their 
students perform on tests. Every year more professions and positions require tests to justify 
access to positions (e.g., via licensure and certification testing). As more and more testing cases 
are litigated, the need for clarification of professional practice increases concomitantly. Finally, 
travel and hotel expenses only continue to rise. 

38. Prior to the 1985 Standards, all of the individual standards composing the 
Standards volume were considered separately as “essential, very desirable, or desirable”. These 
statements indicated relative degrees of importance. The 1985 Standards were identified as 
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primary, secondary, or conditional, depending upon both the importance and the nature of their 
use. Beginning in 1999, no status was assigned to individual standards, a practice that was also 
continued for the 2014 Standards. I heard informally from members of the 1999 Joint 
Committee that the reason for the change was to de-emphasize the role that the Joint Standards 
might play in litigation. 

39. One can glean from the prior discussion that the Joint Standards are changed and 
updated approximately every 10-15 years. The revision process has been taking an ever-longer 
amount of time, probably due at least in part to the increased focus on educational testing in the 
accountability movement across education in the United States. Testing is now used in part to 
affect federal budgets allocated to states to provide education and, for example, for teacher and 
professional staff evaluation. It is likely that the time frame will continue to increase as the focus 
on educational testing shows no letup in the foreseeable future. In fact, it is clear that the Joint 
Standards were once revised every 10 years and for the past two revisions, it has taken about 15 
years. One reason for this temporal increase relates to the complexity of the changes. 

40. Having three important professional associations involved in the development and 
updating of the Joint Standards helps bring much credence to them. It also, however, makes the 
process more cumbersome due to the communications and decision-making necessitated by 
having three professional associations involved that may all see various professional issues 
differently. 

41. Since the revision of the 1999 Standards, the development of revised Joint 
Standards is controlled by a Management Committee. The three associations (AERA, APA, and 
NCME) established a Management Committee consisting of three representatives, one from each 
sponsoring organization. These members are appointed by the Chief Executive Officers of 


14 

JA2818 


Case l:14-cv-00857-TSC Document 67-5 Filed 01/21/16 Page 16 of 24 
USCA Case #17-7035 Document #1715850 Filed: 01/31/2018 Page 458 of 517 

AERA and APA and by the President of NCME. Each member normally serves a 3-year, 
renewable term. Members are usually appointed so that terms are staggered, providing for 
continuity. Members may be reappointed for one additional three-year term, and in years when 
the Joint Standards are in active revision, members’ terms for the duration of the publication 
period may be extended by each sponsoring organization. 

42. Members of the Management Committee oversee all aspects of the Joint 
Standards on an ongoing basis, including, but not limited to: publication and distribution, 
oversight of the Development Fund, protecting the copyright of the Standards, archival activities, 
gathering information about their use, and potential revision issues. The Management 
Committee represents both the interests of the Joint Standards and the interests of all three 
sponsoring organizations in conducting its administrative duties. 

43. At least once every five years, members of the Management Committee confer 
with the leadership of their respective organizations to assess the need for revision of the Joint 
Standards. 

44. If it is determined that the Joint Standards need revision, the Management 
Committee first appoints co-chairs of a “Joint Committee” that addresses the specifics of the 
revision effort. In the case of the 2014 Standards, a website was developed whereby members of 
the sponsoring organizations could provide feedback concerning the changes that they believe 
necessary or important. Following this process, in collaboration with the co-chairs, the 
Management Committee appoints members of the Joint Committee who represent the sponsoring 
organizations. The Joint Committee provides a structure for communication with the sponsoring 
organizations throughout the revision process. 
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45. One of the reasons that the process is so lengthy is that the Joint Committee is 
composed of nationally prominent experts cutting across clinical psychology, counseling 
psychology, school psychology, industrial/organizational psychology, clinical neuropsychology, 
and educational testing. The Joint Committee is composed of highly regarded professionals at 
the top of their respective fields, who have specialized knowledge regarding the use of tests and 
measures in their respective disciplines. Getting such extraordinarily busy individuals to agree to 
serve in a volunteer, unpaid fashion is difficult enough. Working with their schedules to set up 
meetings when all can attend is administratively an arduous process. My understanding is that 
the individuals take on this task as service to their profession, their associations, and the Joint 
Standards themselves. 

46. The Management Committee oversees the process throughout the development of 
new standards, and reports back to their respective associations. 

47. The entire Joint Standards revision process is financed through sales of a prior 
version of the Joint Standards. As noted above, the direct costs for the development process has 
been estimated at $400,000 for the 2014 Standards and between $500,000 and $600,000 for the 
1999 Standards. 

48. What Public Resource did with the 1999 Standards : I reviewed Public Resource’s 
discovery responses and the transcripts (with exhibits) from the depositions of Carl Malamud 
(President and Founder of Public Resource) and Chris Butler (Office Manager of the Internet 
Archive). In May 2012, I understand that Mr. Malamud purchased a used copy of the 1999 
Standards. I understand further that he then sliced the printed pages out from their bindings, 
scanned the pages to a PDF file, and posted the file to Public Resource’s website as well as to a 
publicly available collection on the Internet Archive. See Exh. C at 5-6. A graduate psychology 
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student who engaged in such actions would probably be dismissed from his or her program. 
Most certainly he or she would be subject to ethics charges that could follow them throughout 
the person’s career. 

49. The PDF file posted to Public Resource’s website and to the Internet Archive 
contained a cover page prepared by Mr. Malamud, leading Internet users who came upon it to 
believe that the 1999 Standards were freely available for download, copying, or whatever use 
someone wanted to make of the text. See Exh. D. To the extent that individuals accessed the 
1999 Standards in this fashion, it would appear to be theft of services. 

50. The PDF file with Mr. Malamud’s cover page was posted to Public Resource’s 
website and to the Internet Archive from July 2012 until June 2014. See Exh. C at 5. Based 
upon incomplete records provided by Public Resource, during the time it was posted to the 
Internet, the PDF file containing the entire text of the 1999 Standards that was posted to Public 
Resource’s website was accessed by Internet users at least 4,405 times. See Exh. C at 9-10. The 
same PDF file containing the entire text of the 1999 Standards that was posted to the Internet 
Archive website was accessed by Internet users at least 1,113 times. See Exh. E. No restrictions 
were placed on this PDF file to prevent Internet users from downloading the file to their local 
hard drives or printing it using a printer attached to an Internet user’s computer. See Exh. F at 
346-47. These accessions represent considerable lost revenue to the organizations supporting the 
Joint Committee. However, even if the incomplete records kept by Public Resource and Internet 
Archive were completely accurate, such numbers would not represent the real numbers as lost 
revenue because one person can download the Joint Standards and share them with hundreds of 
colleagues. 
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51. In December 2013, AERA asked Mr. Malamud to remove the 1999 Standards 
from the Internet locations where he posted the document. He refused. Ultimately, Mr. 
Malamud did remove the 1999 Standards from public view on the Internet, but only after Public 
Resource was sued for copyright infringement and threatened with a motion for a preliminary 
injunction. See Exhs. F at 324-26, G. 

52. During his deposition, Mr. Malamud testified that, should Public Resource 
succeed in this litigation, it would be a very easy matter for him to re-post the 1999 Standards to 
his company’s website and to the Internet Archive website. See Exh. F at 307. Further, Mr. 
Malamud contemplated that Public Resource might do the same with the 2014 Standards. See 
Exh. F at 308-09. Such actions would almost certainly lead to the 2014 Standards being the last 
one developed or published. I have heard from definitive sources within the three organizations 
(AERA, APA and NCME) that without the revenue from the sale of the Joint Standards, there 
would not be funding from other sources to continue updating them. There simply would not be 
funding to continue the updates without these needed sales revenues. 

53. Given the changes happening concomitantly in testing, the testing profession, 
and society generally, the cessation of updates to the Joint Standards would be a travesty. The 
profession relies on the Joint Standards increasingly in a changing world where high stakes 
decisions are often buttressed by information gleaned from tests and measures. Indeed, the Joint 
Standards are needed. The Joint Standards are critically important to professionals who work 
with tests and measures in education and psychology. 

54. Public Resource’s justification for posting the 1999 Standards to the Internet : It is 
Public Resource’s view that, once the 1999 Standards were incorporated by reference into 
federal and/or state regulations, the 1999 Standards lost their copyright protection. As a 
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consequence, Public Resource believes that it and others can freely reproduce the 1999 Standards 
(in this case, in electronic format), and post the document to the Internet so that it is freely 
available to everyone. See Exh. H. 

55. The past and continued harm that electronically reproducing and posting the 1999 
Standards to the Internet will cause AERA, APA and NCME : The three associations that are the 
Plaintiffs in this case (AERA, APA, and NCME) are integrally involved in the revision and 
publishing of the Joint Standards. I have discussed the prospects of the continuation of the Joint 
Standards with knowledgeable representatives of all three organizations. I fear that the 2014 
Standards will be the final version should AERA’s, APA’s, and NCME’s copyright infringement 
claims against Public Resource not succeed. The loss of income caused by the document being 
made freely available has already had a significant and negative impact. The loss of sales 
revenue negatively affected the three associations' budget for the development of the 2014 
Standards, and cost the associations some credibility for seeming to permit an organization such 
as Public Resource to violate copyrights that the testing profession considers so sacred (because 
tests too are copyrighted). 

56. Almost certainly, none of the three associations would be willing or able to 
finance the continuation of revisions to the Joint Standards if they are made freely available. 
The Plaintiff associations would not be financially able to continue the re-development process 
into the future. None of the associations would even have the inclination of their governance or 
membership to carry on with publishing the Joint Standards, given the additional burden this 
would place on membership costs. 

57. The past and continued harm that electronically reproducing and posting the 1999 
Test Standards to the Internet will cause to the testing professions and the public : The primary 
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consequences of not revising the Joint Standards would be twofold: to the public, who are 
impacted by changes in testing practices, and to test users and their clients. Because of society’s 
reliance on test results, a significant portion of the population is benefitted by proper testing 
practices (e.g., employers select the best and most appropriate job candidates, colleges and 
universities choose the applicants most likely to succeed in their programs, students receive 
credit for their learning, programs can assess their successfulness). Changes are necessitated to 
testing practices when societal norms and technology change. Recent editions of the Joint 
Standards have included chapters on the testing of ethnic minorities, individuals with disabilities, 
and fairness, for example. The Joint Standards represent something of a gold standard to which 
test developers and users aspire. The Joint Standards also have been modified as the needs for 
various kinds of measurement have changed. Should the practice of posting the Joint Standards 
to the Internet continue, it is likely that there will be no formally sanctioned process for their 
continuation. The agency that I run is essentially one for consumer protection. Without the Joint 
Standards, I fear that many customers (clinical psychologists, counseling psychologists, 
industrial psychologists, school psychologists, test developers, psychometricians, and the 
organizations that each works in) will be the ones losing. 

Conclusions 

58. The Joint Standards represent the single best and most complete statement of how 
tests and other measurements should be developed, used, evaluated, and interpreted. 
The Joint Standards have a long development history (for the social sciences). It is a history that 
is currently endangered by what I consider copyright theft. We in academe and the scholarly 
professions often report that all we have is our ideas and our writing. If someone is able to steal 
our ideas so openly and callously, our professions and indeed our society suffer. 


20 

JA2824 


Case l:14-cv-00857-TSC Document 67-5 Filed 01/21/16 Page 22 of 24 
USCA Case #17-7035 Document #1715850 Filed: 01/31/2018 Page 464 of 517 

59. AERA, APA, and NCME have engaged in a laborious and time consuming 
project (actually a history of projects) that they expected to be rewarded with resultant modest 
revenues. The vast portion of these revenues has gone to funding the process for the continual 
revision of the Joint Standards. Without such a revenue stream, the Joint Standards may end 
with the current edition. 

60. If there is not a next edition of the Joint Standards in the 2020s, then needed 
changes in professional practice would not be acknowledged in as formal and yet aspirational a 
manner as permitted by the Joint Standards. While the Joint Standards are not published to 
make money per se, there is an expectation of modest revenues. The Joint Standards are 
debated, considered, and written to improve practice. Funding is needed to continue this effort. 

61. Extremely well qualified members of our professions are willing to volunteer 
their time to serve on the Joint Committee to work on the Joint Standards. They do so for two 
primary reasons: i) to improve professional practice in their area of expertise, and iiO to benefit 
their professional associations. If the latter goal is removed due to lost revenues to the 
professional associations, then the quality of those willing to serve is likely to be reduced. 
People may only engage in this work if they are compensated, which again would be extremely 
difficult given the lack of or severe reduction in revenues. 

62. The current Joint Standards were published in 2014. Yet the version that Public 
Resource placed on the Internet for free access was the 1999 version of the Joint Standards. 
Unsuspecting people (e.g., students) may well access these freely available standards and believe 
that they are the “current” Joint Standards. Moreover, given the high stakes nature of our 
society presently, suppose a small test developer (and there are many of them) accessed the 
outdated Joint Standards and used them to develop, use, and/or interpret the results of a test. It 
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is possible that their decisions and actions would be out of date. Further, given that the Joint 
Standards have been quoted as having been given great deference by the courts, it is possible 
that should a situation like the above occur, a test developer or test user could be placed in a 
situation whereby the advice that they follow is outdated, perhaps inappropriate advice. They 
could even be subjected to legal liability for such a well-intended, if somewhat naive, action. 

63. I expect that when one successively publishes a document and revisions to that 
document, one has a reasonable expectation that the publisher does so on the basis of exclusivity. 
No one else can publish that same document. The Mental Measurement Yearbooks, which the 
Buros Center for Testing that I direct publishes, lost considerable money in its first four or five 
editions. We are now publishing the 20 th edition, and we have the expectation to earn revenues 
with each successive edition. I would hope that if someone else would be prohibited from 
simply copying our books and distributing them for free. The well-earned reputation of solid and 
professional work should be rewarded with an expectation of a fair return. Giving an item away 
for free, such as the Joint Standards, violates that very principle. If the revenue loss is serious 
enough, it may remove the desire and expectation for the sponsoring organizations (AERA, APA 
and NCME) to continue publishing the document. That is exactly the situation in which these 
three professional organizations find themselves. 

Engagement and Compensation 

64. A Letter of Agreement engaging my services in this action and stating the 
compensation that I will be paid for my study and testimony in the case is attached to this Report 
as Exhibit I. 

I DECLARE, und the penalty of perjury, that the foregoing is true and correct. 
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Kurt F. Gcisinger, Ph. D. 
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List of Materials Considered 

1. AERA, APA and NCME’s Complaint - 5/23/14; 

2. Exhibits A and B to AERA, APA and NCME’s Complaint - the Copyright Registrations 
issued for the 1999 Standards; 

3. The Counterclaim and Answer to the Complaint of Defendant, Public.Resource.Org, Inc. 
(“Public Resource”) - 7/14/14; 

4. AERA, APA and NCME’s Reply to Public Resource’s Counterclaim - 8/21/14; 

5. AERA, APA and NCME’s Amended Disclosures - 05/18/15; 

6. Public Resource’s Initial Disclosures - 05/18/15; 

7. Public Resource’s Amended Interrogatory Answers (1 st Set) - 12/15/14; 

8. Public Resource’s Admissions’ Responses - 11/3/14; 

9. Public Resource’s Interrogatory Answers (2 nd Set) - 3/2/15; 

10. Public Resource’s Amended Ans wer to Interrogatory No. 8 - 6/4/15; 

11. AERA, APA and NCME’s Interrogatory Ans wers - 1/20/15; 

12. AERA, APA and NCME’s Admissions’ Responses - 1/20/15; 

13. The transcript and exhibits from the deposition of the Internet Archive (by Christopher 
Butler) taken on December 2, 2014; 

14. The transcript and exhibits from the deposition of the Public Resource (by Carl 
Malamud) taken on May 12, 2015; 

15. Conversations with Felice J. Levine, Ph. D., Executive Director of AERA. It is my 
understanding that AERA is the publisher of the 1999 and 2014 Standards; 
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16. Daniel R. Eignor, The Standards for Educational and Psychological Testing, APA 
Handbook of Testing and Assessment in Psychology, Vol. 1 at 245-250, (K. F. 
Geisinger ed., American Psychological Association, 2013); 

17. Susan E. Embretson & Steven P. Reise, Psychometric Methods: Item Response 
Theory for Psychologists (Lawrence Erlbaum Associates, Inc. 2000); 

18. Frederic M. Lord, Applications of Item Response Theory to Practical Testing 
Problems (Lawrence Erlbaum Associates, Inc. 1980); 

19. Roderick P. McDonald, Test theory: A Unified Treatment (Taylor & Francis 
1999); 

20. Jum C. Nunnally, Psychometric Theory (2d ed., McGraw-Hill 1978); 

21. Robert L. Thorndike, Applied psychometrics (Houghton Mifflin Co. 1982). 
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Sales Report, 1999 Edition 


Period 

Notes 

No. of 

Units 

FY 1999 

est. 

1,788 

FY 2000 

est. 

3,797 

FY 2001 

est. 

3,755 

FY 2002 

est. 

5,592 

FY 2003 

est. 

3,310 

FY 2004 

est. 

3,218 

FY 2005 

Actual 

3,803 

FY 2006 

Actual 

3,888 

7/1/06-12/31/06 

Actual 

2,144 

FY 2007 

Actual 

3,077 

FY 2008 

Actual 

3,358 

FY 2009 

Actual 

2,590 

FY 2010 

Actual 

3,043 

FY 2011 

Actual 

2,132 

FY 2012 

Actual 

1,649 

FY 2013 

Actual 

1,732 

FY 2014 

Actual 

855 

Total Units Sold 

49,710 
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Note: Estimates are based on revenue earned and reported. 
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APA Membership Statistics 


Year 

Associates 

Members 

Fellows 

Total 

2014 

7,866 

62,924 

4,449 

79,796 

2013 

8,350 

69,248 

4,555 

82,153 

2012 

8,535 

70,054 

4,491 

83,080 

2011 

8,593 

71,247 

4,499 

84,339 

2010 

9,223 

77,508 

4,626 

91,306 

2009 

8,775 

78,618 

4,626 

92,019 

2008 

8,318 

79,152 

4,852 

92,322 

2007 

7,943 

79,407 

4,705 

92,055 

2006 

7,385 

79,158 

4,653 

91,196 

2005 

7,056 

78,542 

4,658 

90,256 

2004 

7,144 

78,416 

4,642 

90,202 

2003 

7,240 

77,938 

4,597 

89,775 

2002 

7,507 

77,316 

4,580 

89,403 

2001 30,31,32 

7,618 

76,660 

4,547 

88,825 

2000 

6,732 

71,847 

4,517 

83,096 

1999 

7,068 

72,064 

4,484 

83,617 

1998 

7,165 

71,364 

4,409 

82,938 

1997 

7,450 

70,587 

4,350 

82,387 

1996 

7,841 

69,335 

4,355 

81,531 

1995 

7,719 

67,063 

4,316 

79,098 

1994 

7,532 

64,234 

4,242 

76,008 

1993 

7,295 

61,806 

4,162 

73,263 

1992 

7,631 

60,892 

4,121 

72,644 

1991 

7,884 

60,259 

4,059 

72,202 

1990 

7,903 

58,311 

4,052 

70,266 

1989 

8,098 

56,226 

3,997 

_68,321 

1988 

8,347 

54,644 

4,005 

66,996 
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1986 

8,587 

50,727 

3,832 

63,146 

1985 

8,511 

47,901 

3,719 

60,131 

1984 

8,539 

46,042 

3,641 

58,222 

1983 

8,600 

44,212 

3,590 

56,402 

1982 

8,681 

42,071 

3,528 

54,282 

1981 

8,706 

40,301 

3,433 

52,440 

1980 

8,865 

38,675 

3,393 

50,933 

1979 

8,909 

36,804 

3,333 

49,047 

1978 

8,817 

34,832 

3,242 

46,891 

1977 

8,658 

32,797 

3,195 

44,650 

1976 

8,278 

30,576 

3,174 

42,028 

1975 

7,795 

28,552 

3,064 

39,411 

1974 

7,357 

26,644 

2,999 

37,000 

1973 

7,052 

25,243 

2,959 

35,254 

1972 

6,832 

23,870 

2,927 

33,629 

1971 

6,611 

22,526 

2,848 

31,985 

1970 

6,532 

21,502 

2,805 

30,839 

1969 

6,070 

19,909 

2,806 

28,785 

1968 

5,640 

18,889 

2,721 

27,250 

1967 

5,219 

17,955 

2,626 

25,800 

1966 

4,812 

17,095 

2,566 

24,473 

1965 

4,362 

16,664 

2,535 

23,561 

1964 

3,791 

15,865 

2,463 

22,119 

1963 

3,213 

15,342 

2,378 

20,933 

1962 

2,623 

14,931 

2,337 

19,891 

1961 

2,033 

14,640 

2,275 

18,948 

1960 

1,408 

14,569 

2,238 

18,215 

1959 

744 

14,485 

2,219 

17,448 

1958 

none 

14,474 

2,170 

16,644 

1957 

13,457 


2,088 

15,545 

1956 

12,503 


2,006 

14,509 

1955 

11,579 


1,896 

13,475 

1954 29 

10,567 


1,813 

12,380 

1953 

9,233 


1,690 

10,903 
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1951 

6,979 


1,576 

8,554 

1950 

5,775 


1,498 

7,272 

1949 

5,299 


1,436 

6,735 

1948 

4,493 


1,261 

5,754 

1947 

3,583 


1,078 

4,661 

1946 28 

3,344 


1,083 

4,427 

1945 

3,161 

1,012 


4,173 

1944 

2,948 

858 


3,806 

1943 

2,716 

760 


3,231 

1942 

2,518 

713 


3,231 

1941 

2,254 

683 


2,937 

1940 

2,075 

664 


2,739 

1939 27 

1,909 

618 


2,527 

1938 26 

1,715 

603 


2,318 

1937 

1,551 

587 


2,138 

1936 

1,431 

556 


1,987 

1935 

1,276 

542 


1,818 

1934 

1,224 

530 


1,754 

1933 

1,135 

535 


1,670 

1932 

985 

525 


1,510 

1931 

737 

530 


1,267 

1930 

571 

530 


1,101 

1929 

353 

540 


893 

1928 25 

165 

534 


699 

1927 23 

92 

516 


608 24 

1926 

41 

494 


535 

1925 22 


471 


471 

1924 20 


464 


464 21 

1923 17 


457 18 


457 19 

1922 


442 


442 

1921 14 ’ 15 


424 


424 16 

1920 13 


393 


393 

1919 


372 


372 
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336 336 

1916 11 


308 


308 12 

1915 10 


291 


291 

1914 


285 


285 

1913 


271 


271 

1912 


262 


262 

1911 9 


244 


244 

1910 


228 


228 

1909 


225 


225 

1908 


209 


209 

1907 


209 


209 

1906 8 


190 


190 

1905 


168 


-168 

1904 


94 


-151 

1903 


135 


135 

1902 


127 


127 

1901 


127 


127 

1900 


127 


127 

1899 


113 


113 

1898 


111 


111 

1897 7 


87 


87 

1896 6 


94 


94 

1895 


78 


78 

1894 5 


67 


67 

1893 


54 3 


54 

1892 


31 1 

31 

1892 


42 2 

42 4 


Footnotes 

1 Preliminary Meeting. 

2 First Annual Meeting. 

3 Figures in parentheses are estimates. 

4 The first mention of membership appears in a tentative ad interim constitution adopted at the first annual meeting (1892) which 
reads: "The right of nomination for membership is reserved to the Council, the election to be made by the Association." 
(Fernberger, 1932, p. 7-8). 


JA2842 


http://www.apa.org/about/apa/archives/membership.aspx[l/20/2016 1:26:25 PM] 







































APA Membership Statistics 

Case l:14-cv-00857-TSC Document 67-10 Filed 01/21/16 Page 6 of 10 

5 In the fitsIlSfii^tlMse^ll^tQfCSSie th EtomtQfterrte#flh§l(58©Q no specifilfflticlOife/Scar/aOlSl witlfla^0b4ig@i|©fE5il7n Article 
II, which provides for a council of six members with the president ex-officio, we find as one of its duties that they "shall nominate 
new members" and also that "the resolutions of the Council shall be brought before the Association and decided by a majority 
vote." (Fernberger, 1932, p. 8). 

6 As early as 1896, one finds that (Lightner) Witmer proposed that "all names nominated by the Council, shall be presented to the 
Association at its opening meeting in written form or visibly displayed upon a blackboard, together with a statement of the 
contribution or contributions to psychology, in virtue of which the persons named are eligible to Membership, and that the final 
action upon such names shall be taken by the Association at the final business meeting." (Fernberger, 1932, p. 8). 

7 Perhaps because of (Lightner) Witmer's motion the previous year, it was voted in 1897 "that nomination blanks be provided by the 
Secretary with spaces for the name, official position and publications of the candidate and the names of two proposers, members of 
the Association; such blanks to be filled in and sent to the Secretary before the meeting and to be read before the Association 
when the name of such candidate comes up for election." (Fernberger, 1932, p. 8). 

8 Council decided in the future to define the qualifications and make them more difficult. This was accomplished in 1906 by a formal 
announcement of the council to the association of the principles which guided them in nominating or declining to nominate 
individuals proposed for membership. "The Constitution reads that those are eligible for membership who are engaged in 'the 
advancement of Psychology as Science.' In interpreting the Constitution the Council has, historically and consistently, recognized 
two sorts of qualifications for membership: professional occupation in psychology and research. The Council now adheres to a 
somewhat strict interpretation of the former of these qualifications so that, in the absence of research, positions held in related 
branches such as philosophy and education, or temporary positions, such as assistantships in psychology, are not regarded as 
qualifying candidates for membership." (Fernberger, 1932, p. 9). 

9 "The Council having for some years back experienced frequent difficulty in securing adequate information regarding applicants for 
membership in the Association made public the following announcement: The Council requests that all recommendations for 
membership in the Association submitted to the Secretary at least one month in advance of the time of election, and that these 
recommendations be accompanied by Statement of the candidate's professional position and by copies of published researches." 
(Fernberger, 1932, p. 9). 

10 In 1915, at the end of this low period, (Charles) Judd questioned the council's interpretation of a statement regarding 
requirements of candidates for admission to membership in the association and moved that it be the sense of the association that 
the statement appended to Article I of the Constitution defining 'temporary positions' should be interpreted to include under this 
head the position of instructor." The motion was carried and we see, for the first time, the association as a whole, rather than the 
council, initiating a definition of qualifications for membership. This motion defines an instructorship as a temporary position and 
hence, for a younger man, throws still greater emphasis on the question of publication. (Fernberger, 1932, p. 10). 

11 In the next year (1916) the council again initiates a move for greater standardization as follows: "A proposal for membership, 
signed by at least two members of the Association, must be submitted to the Secretary, for the Council at least one month in 
advance of the annual meeting. The proposal must be accompanied by (1) a statement of the candidate's professional position and 
degrees, naming the institutions by which and the dates when, conferred, and (2) by copies of his published researches. In the 
absence of acceptable publications of a psychological character, or a permanent position in psychology, the conditions of 
membership will not be regarded as having been fulfilled." This announcement merely still further defined Judd's motion of the year 
before and for the first time specifically mentions academic degrees. (Fernberger, 1932, p. 10). 

12 In the same year (1916) the council also announced that "Proposals to membership that are unfavorably acted upon by the 
Council must be renewed for action at a subsequent meeting." (Fernberger, 1933, p. 10). 

13 In this year (1920) it was voted "that a committee of three, including the Secretary, be appointed by the President to revise the 
requirements for membership and to report at the next annual meeting of the Association." Boring was appointed chairman with 
Dunlap and Terman as the committee. It was also proposed and voted that this be referred to the new committee, that foreign 
members be not elected to active membership but "that distinguished psychologists in foreign countries be elected, upon 
recommendation of the Council, corresponding members of the Association and that such corresponding members be not subject 
to the payment of dues." (Fernberger, 1932, p. 11). 

14 In 1921 this committee reported and the report was adopted by the association in part only. The committee recommended two 
grades of membership, members and fellows. The recommendation was for the creation of 100 fellows within the membership of 
the association and asked for a new committee to consider the mode of election of these fellows, their qualifications, functions, etc. 
(Fernberger, 1932, p. 11). 

15 But the first part of the report, which was adopted and became law, more fully and clearly defines qualifications for membership. 
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with the object of the Association,' the advancement of psychology as a science ' as stated in the Constitution; and they believe 
that this end will be most readily secured by placing emphasis upon scientific publication. They believe further that the time has 
come to abandon professional position or title as a basis for election on account of the reason that the multiplication of special 
positions, especially in nonacademic fields of psychology, makes the interpretation of the significance of position impracticable." In 
order to enforce this point of view, the Association adopted the Committee's specific recommendations for qualifications for 
members the establishment of an 'associate' grade of membership and to report to the 1924 meeting with recommendations." 
(Fernberger, 1932, p. 11-12). 

16 The Association adopted the committee's specific recommendations for qualifications for membership which were "(1) acceptable 
published research of a psychological character and (2) of the degree of the Doctor of Philosophy, based in part on a psychological 
dissertation." The question of the degree may be waived by the council in special cases providing it states its reasons when making 
the nomination. And further "(3) it is also expected that the Council shall assure itself that the nominee is actively engaged in 
psychological work at the time of the nomination." (Fernberger, 1932, p. 12). 

17 1924: At the meeting the year before it was decided that nominations must be made "not later than March 15 of the year in which 
the nomination is to be first acted upon." (Fernberger, 1932, p. 12). 

18 1923: the Council shall have power to defer action upon such proposals for membership as it deems necessary providing, 
however, that the third annual meeting after the original receipt of the nomination papers, it must decide either to present or not to 
present the candidate's name to the Association. A proposal for membership cannot be reviewed until two years have elapsed after 
the Council's action upon it." (Fernberger, 1932, p. 12). 

19 1923: It was voted that a committee of three be appointed "to consider the advisability of the establishment of an 'associate' 
grade of membership and to report to the 1924 meeting with recommendations." Boring was appointed chairman of this committee 
with F. L. Wells and Hunter. The report, which was a lengthy one, was presented in 1924 and printed in the Proceedings. The 
committee "are unanimous in the opinion that the purposes of the Association will be served by the creation of a class of 
Associates " because the growth of psychology has "created distinct groups of persons engaged in psychological work of a 
scientific character at less advanced levels" so that the fundamental requirements of membership can no longer be met by this 
group. Hence the Committee proposes a class of Associates eligible under the following qualifications: "(1) any person devoting full 
time to work that is primarily psychological; (2) any person with the degree of Doctor of Philosophy, based in part upon a 
psychological dissertation and conferred by a graduate school of recognized standing, or (3) scientists, educators or distinguished 
persons, whom the Council may recommend for sufficient reason." (Fernberger, 1932, p. 12). 

20 The exclusionary tendency that predominated the first two decades of the 20 th century was to eliminate from membership 
individuals who were not directly involved in psychological pursuits. The Definition of Psychology officially hinged on the 
terminology of the association's constitution as "The Advancement of Psychology as a Science," which was primarily that of 
academic psychology involved in research, primarily experimental research. In general, it was the individuals on the periphery of 
psychology who were eliminated, those with a non professional, amateur's interest in the field, and those primarily involved in 
philosophy. (Evans, 1992, p. 78). 

21 The committee then further recommends certain methods of application of the change. The application for associateship may be 
made by the candidate rather than by two proposers as for membership. But two endorsers must be specified by the applicant with 
whom the council may (and always did) communicate. The application must be received by October 1 instead of March 15 as for 
Members. The council is to consider all applications for associateship and recommend to the association which elects. The 
associates to have the right of the floor at the annual meetings and the right participate in the programs but are not entitled to hold 
office or to vote. Upon the recommendation of the council and by the majority vote of the annual meeting an associateship may be 
terminated. (Fernberger, 1932, p. 13). 

22 The necessary by-laws and constitutional changes were passed for the first time in 1924 and received the necessary second 
passage in 1925. Immediately and at the same meeting these changes the by-laws became effective by the election of forty-five 
associates. (Fernberger, 1932, p. 13). 

23 The committee suggests a form by means of which associates may apply for membership. This is to be accomplished by having 
all associates asked each year if they care to make application for membership. The committee also suggested a similar form of 
application blank for both grades. The changes were passed in 1927 on its second reading. This change had the effect of still 
further raising the qualifications for Membership by defining a policy of the council demanding at least two publications beyond the 
doctorate thesis. It makes the date of application for both grades uniform with a closing on March 15th. (Fernberger, 1932, p. 14). 

24 The council in 1927 were willing to recommend only a relatively few associates for membership inasmuch as they were not 
willing to construe graduate work as "devoting full time to professional work in psychology." Hence in this year a change was made 
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recognized graduate school or who at the time of application are devoting full time to professional or graduate work in psychology." 
(Fernberger, 1932, p. 14). 

25 In 1928 a new mechanism for handling nominations was approved by the council. According to this new method, which is still in 
practice, the Secretary first reviews each nomination. For those cases where there is no question that the candidate is eligible for 
associateship but not for membership (and this includes the great majority of the cases) the secretary himself approves the 
nomination and writes to so inform the candidate, telling him that if he objects to this ruling and insists upon being considered for 
membership, that his case will be presented to the council. For all other cases, those who seem to be eligible for membership and 
those whom the secretary considers are not qualified for associateship, the former method of submitting transcripts for the 
consideration of the council is followed. (Fernberger, 1932, p. 15). 

26 1) The association shall consist of three classes of persons: first, members, second, associates and third, honorary members. 2) 
Members of the association shall be persons who are primarily engaged in the advancement of psychology as a science. 3) 
Associates shall be such other persons as are interested in the advancement of Psychology as a science and who desire affiliation 
with the association for this reason. Three honorary members shall be persons, who having reached the age of seventy years and 
having been members for at least twenty years, request such status. (APA Yearbook, 1938, pgs. 14-15). 

27 The association shall consist of three classes of persons: first, members, second, associates and third, life members. Four life 
members shall be persons who, having reached the ages of seventy years and having been members of the association for at least 
twenty years, request such status. (APA Yearbook, 1939, pg. 21). 

28 The association shall consist of three classes of members: Fellows, associates and life members. Two fellows of the association 
shall be persons who are primarily engaged in the advancement of psychology as a profession..(APA Yearbook, 1946-1947, p. 26). 

29 1954, the council formally requested the Policy and Planning board to study the standards for membership, which, at that time, 
were those set forth in article II of the original (1946) bylaws. These classes of Membership were defined as follows: 

■ Fellow. Holder of Doctoral degree based in part of a dissertation psychological in nature, prior membership as an associate 
and acceptable, published research beyond the dissertation or four years of acceptable professional experience. The 
nomination was made by a Division to the Board of Directors, which, if approved was recommended to the council. 

■ Associate. Holder of a doctorate or completion of two years of graduate work in psychology, or completion of the year of 
graduate study and one year of professional experience; or that the individual be a distinguished person recommended by 
the board of directors. 

■ Life Member. A fellow or and associate for 25 years and attainment at age 65. 

As a result of its deliberations, the Policy and Planning board recommended to the Board of Directors that the categories be revised. 
After some years of debate, the Council approved three classes of membership: fellow, member and associate. On approval by the 
membership, this change went into effect at the beginning of 1958. Standards for Fellow were strengthened by requiring the 
nominating division to furnish the Membership Committee with clear evidence of the candidate's unusual or outstanding 
accomplishment in Psychology. The new category of member required the doctorate, thus preserving the time-honored criterion. 
The class of associate was continued for subdoctoral psychologist, but it was stipulated that when an associate was awarded the 
doctorate, he or she would automatically be raised to member. The life member category was dropped, but waiver of dues, when 
requested, for members over 65 years of age and with 25 years of membership were retained. Various types of affiliates, such as 
student, division and foreign were recognized, but, as in 1945, they were not counted as members of the association. (Evans, 

1992, p. 182-183). 

30 Member: The minimum standard for election to member status is receipt of the doctoral degree based in part on a psychological 
dissertation or based on other evidence of proficiency in psychological scholarship. The doctoral degree must be received from a 
program primarily psychological in content and must be conferred by a graduate or professional school that (a) is regionally 
accredited or (b) has achieved such accreditation within five years of the year the doctorate was granted, or (c) is a school of 
equivalent standing outside of the United States. All members may vote and hold office in the association. (Directory, 2001, p. IX). 

31 Associate Member: To become an associate member, an applicant must meet one of two sets of requirements: (a) must have 
completed two years of graduate work in psychology at a regionally accredited graduate or professional school or (b) must have 
received the master's degree in psychology from a regionally accredited graduate or professional school. Associate members 
initially may not vote or hold office in APA. After five consecutive years of membership, associate members may vote. (Directory, 
2001, p. IX). 
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fellows of the APA. Candidates for fellows status must previously have been members for at least one full year, have a doctoral 
degree in psychology and at least five years of acceptable experience beyond that degree, hold membership in the nominating 
division, and present evidence of unusual or outstanding contribution or performance in the field of psychology. Fellows may vote 
and hold office. (Directory, 2001, p. IX). 
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United States Capitol 
Washington, D.C. 


9:10 P.M. EST 

THE PRESIDENT: Mr. Speaker, Mr. Vice President, members of Congress, distinguished 
guests, and fellow Americans: 

Last month, I went to Andrews Air Force Base and welcomed home some of our last troops to 
serve in Iraq. Together, we offered a final, proud salute to the colors under which more than a 
million of our fellow citizens fought -- and several thousand gave their lives. 

We gather tonight knowing that this generation of heroes has made the United States safer 
and more respected around the world. (Applause.) For the first time in nine years, there are 
no Americans fighting in Iraq. (Applause.) For the first time in two decades, Osama bin 
Laden is not a threat to this country. (Applause.) Most of al Qaeda’s top lieutenants have 
been defeated. The Taliban’s momentum has been broken, and some troops in Afghanistan 
have begun to come home. 

These achievements are a testament to the courage, selflessness and teamwork of America’s 
Armed Forces. At a time when too many of our institutions have let us down, they exceed all 
expectations. They’re not consumed with personal ambition. They don’t obsess over their 
differences. They focus on the mission at hand. They work together. 

Imagine what we could accomplish if we followed their example. (Applause.) Think about the 
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unstable parts of the world. An economy built to last, where hard work pays off, and 
responsibility is rewarded. 

We can do this. I know we can, because we’ve done it before. At the end of World War II, 
when another generation of heroes returned home from combat, they built the strongest 
economy and middle class the world has ever known. (Applause.) My grandfather, a veteran 
of Patton’s Army, got the chance to go to college on the Gl Bill. My grandmother, who worked 
on a bomber assembly line, was part of a workforce that turned out the best products on 
Earth. 

The two of them shared the optimism of a nation that had triumphed over a depression and 
fascism. They understood they were part of something larger; that they were contributing to a 
story of success that every American had a chance to share - the basic American promise 
that if you worked hard, you could do well enough to raise a family, own a home, send your 
kids to college, and put a little away for retirement. 

The defining issue of our time is how to keep that promise alive. No challenge is more 
urgent. No debate is more important. We can either settle for a country where a shrinking 
number of people do really well while a growing number of Americans barely get by, or we can 
restore an economy where everyone gets a fair shot, and everyone does their fair share, and 
everyone plays by the same set of rules. (Applause.) What’s at stake aren’t Democratic 
values or Republican values, but American values. And we have to reclaim them. 

Let’s remember how we got here. Long before the recession, jobs and manufacturing began 
leaving our shores. Technology made businesses more efficient, but also made some jobs 
obsolete. Folks at the top saw their incomes rise like never before, but most hardworking 
Americans struggled with costs that were growing, paychecks that weren’t, and personal debt 
that kept piling up. 

In 2008, the house of cards collapsed. We learned that mortgages had been sold to people 
who couldn’t afford or understand them. Banks had made huge bets and bonuses with other 
people’s money. Regulators had looked the other way, or didn’t have the authority to stop the 
bad behavior. 

It was wrong. It was irresponsible. And it plunged our economy into a crisis that put millions 
out of work, saddled us with more debt, and left innocent, hardworking Americans holding the 
bag. In the six months before I took office, we lost nearly 4 million jobs. And we lost another 
4 million before our policies were in full effect. 
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than 3 million jobs. (Applause.) 

Last year, they created the most jobs since 2005. American manufacturers are hiring again, 
creating jobs for the first time since the late 1990s. Together, we’ve agreed to cut the deficit 
by more than $2 trillion. And we’ve put in place new rules to hold Wall Street accountable, so 
IMMsTMkMfffs ne^er (Applause.) 

The state of our Union is getting stronger. And we’ve come too far to turn back now. As long 
as I’m President, I will work with anyone in this chamber to build on this momentum. But I 
intend to fight obstruction with action, and I will oppose any effort to return to the very same 
policies that brought on this economic crisis in the first place. (Applause.) 

No, we will not go back to an economy weakened by outsourcing, bad debt, and phony 
financial profits. Tonight, I want to speak about how we move forward, and lay out a blueprint 
for an economy that’s built to last -- an economy built on American manufacturing, American 
energy, skills for American workers, and a renewal of American values. 

Now, this blueprint begins with American manufacturing. 

On the day I took office, our auto industry was on the verge of collapse. Some even said we 
should let it die. With a million jobs at stake, I refused to let that happen. In exchange for 
help, we demanded responsibility. We got workers and automakers to settle their 
differences. We got the industry to retool and restructure. Today, General Motors is back on 
top as the world’s number-one automaker. (Applause.) Chrysler has grown faster in the U.S. 
than any major car company. Ford is investing billions in U.S. plants and factories. And 
together, the entire industry added nearly 160,000 jobs. 

We bet on American workers. We bet on American ingenuity. And tonight, the American auto 
rtedBHB Bis back. (Applause.) 

What’s happening in Detroit can happen in other industries. It can happen in Cleveland and 
Pittsburgh and Raleigh. We can’t bring every job back that’s left our shore. But right now, it’s 
getting more expensive to do business in places like China. Meanwhile, America is more 
productive. A few weeks ago, the CEO of Master Lock told me that it now makes business 
sense for him to bring jobs back home. (Applause.) Today, for the first time in 15 years, 

Master Lock’s unionized plant in Milwaukee is running at full capacity. (Applause.) 

So we have a huge opportunity, at this moment, to bring manufacturing back. But we have to 
seize it. Tonight, my message to business leaders is simple: Ask yourselves what you can do 
to bring jobs back to your country, and your country will do everything we can to help you 
succeed. (Applause.) 
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profits overseas. Meanwhile, companies that choose to stay in America get hit with one of the 
highest tax rates in the world. It makes no sense, and everyone knows it. So let’s change it. 

First, if you’re a business that wants to outsource jobs, you shouldn’t get a tax deduction for 
doing it. (Applause.) That money should be used to cover moving expenses for companies 
like Master Lock that decide to bring jobs home. (Applause.) 

Second, no American company should be able to avoid paying its fair share of taxes by 
moving jobs and profits overseas. (Applause.) From now on, every multinational company 
should have to pay a basic minimum tax. And every penny should go towards lowering taxes 
for companies that choose to stay here and hire here in America. (Applause.) 

Third, if you’re an American manufacturer, you should get a bigger tax cut. If you’re a high- 
tech manufacturer, we should double the tax deduction you get for making your products here. 

And if you want to relocate in a community that was hit hard when a factory left town, you 
should get help financing a new plant, equipment, or training for new workers. (Applause.) 

So my message is simple. It is time to stop rewarding businesses that ship jobs overseas, 
and start rewarding companies that create jobs right here in America. Send me these tax 
reforms, and I will sign them right away. (Applause.) 

We’re also making it easier for American businesses to sell products all over the world. Two 
years"agoJ set agoal of doubling U.S. exports over five years. With the bipaftisan trade 
agreements we signed into law, we’re on track to meet that goal ahead of schedule. 

(Applause.) And soon, there will be millions of new customers for American goods in 
Panama, Colombia, and South Korea. Soon, there will be new cars on the streets of Seoul 
imported from Detroit, and Toledo, and Chicago. (Applause.) 

I will go anywhere in the world to open new markets for American products. And I will not 
stand by when our competitors don’t play by the rules. We’ve brought trade cases against 
China at nearly twice the rate as the last administration — and it’s made a difference. 

(Applause.) Over a thousand Americans are working today because we stopped a surge in 
Chinese tires. But we need to do more. It’s not right when another country lets our movies, 
music, and software be pirated. It’s not fair when foreign manufacturers have a leg up on ours 
only because they’re heavily subsidized. 

Tonight, I’m announcing the creation of a Trade Enforcement Unit that will be charged with 
investigating unfair trading practices in countries like China. (Applause.) There will be more 
inspections to prevent counterfeit or unsafe goods from crossing our borders. And this 
Congress should make sure that no foreign company has an advantage over American 
manufacturing when it comes to accessing financing or new markets like Russia. Our workers 
are the most productive on Earth, and if the playing field is level, I promise you -- America will 
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I also hear from many business leaders who want to hire in the United States but can’t find 
workers with the right skills. Growing industries in science and technology have twice as 
many openings as we have workers who can do the job. Think about that — openings at a 
time when millions of Americans are looking for work. It’s inexcusable. And we know how to 
fix it. 

Jackie Bray is a single mom from North Carolina who was laid off from her job as a mechanic. 

Then Siemens opened a gas turbine factory in Charlotte, and formed a partnership with 
Central Piedmont Community College. The company helped the college design courses in 
laser and robotics training. It paid Jackie’s tuition, then hired her to help operate their plant. 

I want every American looking for work to have the same opportunity as Jackie did. Join m 
in a national commitment to train 2 million Americans with skills that will lead directly to a job. 
(Applause.) My administration has already lined up more companies that want to help. Model 
partnerships between businesses like Siemens and community colleges in places like 
Charlotte, and Orlando, and Louisville are up and running. Now you need to give more 
community colleges the resources they need to become community career centers — places 
that teach people skills that businesses are looking for right now, from data management to 
high-tech manufacturing. 

And I want to cut through the maze of confusing training programs, so that from now on, 
people like Jackie have one program, one website, and one place to go for all the information 
and help that they need. It is time to turn our unemployment system into a reemployment 
system that puts people to work. (Applause.) 

These reforms will help people get jobs that are open today. But to prepare for the jobs of 
tomorrow, our commitment to skills and education has to start earlier. 

For less than 1 percent of what our nation spends on education each year, we’ve convinced 
nearly every state in the country to raise their standards for teaching and learning - the first 
time that’s happened in a generation. 

But challenges remain. And we know how to solve them. 

At a time when other countries are doubling down on education, tight budgets have forced 
states to lay off thousands of teachers. We know a good teacher can increase the lifetime 
income of a classroom by over $250,000. A great teacher can offer an escape from poverty to 
the child who dreams beyond his circumstance. Every person in this chamber can point to a 
teacher who changed the trajectory of their lives. Most teachers work tirelessly, with modest 
pay, sometimes digging into their own pocket for school supplies - just to make a difference. 
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a deal. Give them the resources to keep good teachers on the job, and reward the best ones. 
(Applause.) And in return, grant schools flexibility: to teach with creativity and passion; to 
stop teaching to the test; and to replace teachers who just aren’t helping kids learn. That’s a 
bargain worth making. (Applause.) 

We also know that when students don’t walk away from their education, more of them walk 
the stage to get their diploma. When students are not allowed to drop out, they do better. So 
tonight, I am proposing that every state - every state - requires that all students stay in high 
school until they graduate or turn 18. (Applause.) 

When kids do graduate, the most daunting challenge can be the cost of college. At a time 
when Americans owe more in tuition debt than credit card debt, this Congress needs to stop 
the interest rates on student loans from doubling in July. (Applause.) 

Extend the tuition tax credit we started that saves millions of middle-class families thousands 
of dollars, and give more young people the chance to earn their way through college by 
doubling the number of work-study jobs in the next five years. (Applause.) 

Of course, it’s not enough for us to increase student aid. We can’t just keep subsidizing 
skyrocketing tuition; we’ll run out of money. States also need to do their part, by making 
higher education a higher priority in their budgets. And colleges and universities have to do 
their part by working to keep costs down. 

Recently, I spoke with a group of college presidents who’ve done just that. Some schools 
redesign courses to help students finish more quickly. Some use better technology. The 
point is, it’s possible. So let me put colleges and universities on notice: If you can’t stop 
tuition from going up, the funding you get from taxpayers will go down. (Applause.) Higher 
education can’t be a luxury — it is an economic imperative that every family in America should 
be able to afford. 

Let’s also remember that hundreds of thousands of talented, hardworking students in this 
country face another challenge: the fact that they aren’t yet American citizens. Many were 
brought here as small children, are American through and through, yet they live every day with 
the threat of deportation. Others came more recently, to study business and science and 
engineering, but as soon as they get their degree, we send them home to invent new products 
and create new jobs somewhere else. 

That doesn’t make sense. 

I believe as strongly as ever that we should take on illegal immigration. That’s why my 
administration has put more boots on the border than ever before. That’s why there are fewer 
illegal crossings than when I took office. The opponents of action are out of excuses. We 
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But if election-year politics keeps Congress from acting on a comprehensive plan, let’s at least 
agree to stop expelling responsible young people who want to staff our labs, start new 
businesses, defend this country. Send me a law that gives them the chance to earn their 
citizenship. I will sign it right away. (Applause.) 

You see, an economy built to last is one where we encourage the talent and ingenuity of every 
person in this country. That means women should earn equal pay for equal work. 

(Applause.) It means we should support everyone who’s willing to work, and every risk-taker 
and entrepreneur who aspires to become the next Steve Jobs. 

After all, innovation is what America has always been about. Most new jobs are created in 
start-ups and small businesses. So let’s pass an agenda that helps them succeed. Tear 
down regulations that prevent aspiring entrepreneurs from getting the financing to grow. 

(Applause.) Expand tax relief to small businesses that are raising wages and creating good 
jobs. Both parties agree on these ideas. So put them in a bill, and get it on my desk this 
year. (Applause.) 

Innovation also demands basic research. Today, the discoveries taking place in our federally 
financed labs and universities could lead to new treatments that kill cancer cells but leave 
healthy ones untouched. New lightweight vests for cops and soldiers that can stop any bullet. 

Don’t gut these investments in our budget. Don’t let other countries win the race for the 
future. Support the same kind of research and innovation that led to the computer chip and 
the Internet; to new American jobs and new American industries. 

And nowhere is the promise of innovation greater than in American-made energy. Over the 
last three years, we’ve opened millions of new acres for oil and gas exploration, and tonight, 

I’m directing my administration to open more than 75 percent of our potential offshore oil and 
ga s resources. (Applause.) Right now - right now - American oil production is the highest 
that it’s been in eight years. That’s right - eight years, ftlot onjy that HHI reffa| 

Hess I (Applause.) 

But with only 2 percent of the world’s oil reserves, oil isn’t enough. This country needs an all- 
out, all-of-the-above strategy that develops every available source of American energy. 

(Applause.) A strategy that’s cleaner, cheaper, and full of new jobs. 

We have a supply of natural gas that can last America nearly 100 years. (Applause.) And my 
administration will take every possible action to safely develop this energy. Experts believe 
this will support more than 600,000 jobs by the end of the decade. And I’m requiring all 
companies that drill for gas on public lands to disclose the chemicals they use. (Applause.) 

Because America will develop this resource without putting the health and safety of our 
citizens at risk. 
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The development of natural gas will create jobs and power trucks and factories that are 
cleaner and cheaper, proving that we don’t have to choose between our environment and our 
economy. (Applause.) And by the way, it was public research dollars, over the course of 30 
years, that helped develop the technologies to extract all this natural gas out of shale rock — 
reminding us that government support is critical in helping businesses get new energy ideas 
off the ground. (Applause.) 

Now, what’s true for natural gas is just as true for clean energy. In three years, our 
partnership with the private sector has already positioned America to be the world’s leading 
manufacturer of high-tech batteries. Because of federal investments, renewable energy use 
has nearly doubled, and thousands of Americans have jobs because of it. 

When Bryan Ritterby was laid off from his job making furniture, he said he worried that at 55, 
no one would give him a second chance. But he found work at Energetx, a wind turbine 
manufacturer in Michigan. Before the recession, the factory only made luxury yachts. Today, 
it’s hiring workers like Bryan, who said, “I’m proud to be working in the industry of the future.” 

Our experience with shale gas, our experience with natural gas, shows us that the payoffs on 
these public investments don’t always come right away. Some technologies don’t pan out; 
some companies fail. But I will not walk away from the promise of clean energy. I will not 
walk away from workers like Bryan. (Applause.) I will not cede the wind or solar or battery 
industry to China or Germany because we refuse to make the same commitment here. 

We’ve subsidized oil companies for a century. That’s long enough. (Applause.) It’s time to 
end the taxpayer giveaways to an industry that rarely has been more profitable, and double¬ 
down on a clean energy industry that never has been more promising. Pass clean energy tax 
credits. Create these jobs. (Applause.) 

We can also spur energy innovation with new incentives. iHe diff erence s 
may be too deep right now to pass a comprehensive plan to fight climate change. But there’s 
no reason why Congress shouldn’t at least set a clean energy standard that creates a market 
for innovation. So far, you haven’t acted. Well, topjigft, I wilL I’m directing my administration 
to allow the development of clean energy on enough public land to power 3 million homes. 

And I’m proud to announce that the Department of Defense, working with us, the world’s 
largest consumer of energy, will make one of the largest commitments to clean energy in 
history — with the Navy purchasing enough capacity to power a quarter of a million homes a 
year. (Applause.) 

Of course, the easiest way to save money is to waste less energy. So here’s a proposal: 

Help manufacturers eliminate energy waste in their factories and give businesses incentives 
to upgrade their buildings. Their energy bills will be $100 billion lower over the next decade, 
and America will have less pollution, more manufacturing, more jobs for construction workers 
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Building this new energy future should be just one part of a broader agenda to repair America’s 
infrastructure. So much of America needs to be rebuilt. We’ve got crumbling roads and 
bridges; a power grid that wastes too much energy; an incomplete high-speed broadband 
network that prevents a small business owner in rural America from selling her products all 
over the world. 

During the Great Depression, America built the Hoover Dam and the Golden Gate Bridge. 

After World War II, we connected our states with a system of highways. Democratic and 
Republican administrations invested in great projects that benefited everybody, from the 
workers who built them to the businesses that still use them today. 

In the next few weeks, I will sign an executive order clearing away the red tape that slows 
down too many construction projects. But you need to fund these projects. Take the money 
we’re no longer spending at war, use half of it to pay down our debt, and use the rest to do 
some nation-building right here at home. (Applause.) 

There’s never been a better time to build, especially since the construction industry was one 
of the hardest hit when the housing bubble burst. Of course, construction workers weren’t the 
only ones who were hurt. So were millions of innocent Americans who’ve seen their home 
values decline. And while government can’t fix the problem on its own, responsible 
homeowners shouldn’t have to sit and wait for the housing market to hit bottom to get some 
relief. 

And that’s why I’m sending this Congress a plan that gives every responsible homeowner the 
chance to save about $3,000 a year on their mortgage, by refinancing at historically low rates. 
(Applause.) No more red tape. No more runaround from the banks. A small fee on the 
largest financial institutions will ensure that it won’t add to the deficit and will give those banks 
that were rescued by taxpayers a chance to repay a deficit of trust. (Applause.) 

Let’s never forget: Millions of Americans who work hard and play by the rules every day 
deserve a government and a financial system that do the same. It’s time to apply the same 
rules from top to bottom. No bailouts, no handouts, and no copouts. An America built to last 
insists on responsibility from everybody. 

We’ve all paid the price for lenders who sold mortgages to people who couldn’t afford them, 
and buyers who knew they couldn’t afford them. That’s why we need smart regulations to 
prevent irresponsible behavior. (Applause.) Rules to prevent financial fraud or toxic dumping 
or faulty medical devices - these don’t destroy the free market. They make the free market 
work better. 

There’s no question that some regulations are outdated, unnecessary, or too costly. In fact, 

JA2862 

https://www.whitehouse.gov/the-press-office/2012/01/24/remarks-president-state-union-address [ 1 /20/20168:50:31 PM] 











Remarks by die President in State of the Union Address | whitehouse.gov 

Case l:14-cv-00857-TSC Document 67-12 Filed 01/21/16 Page 12 of 18 
I’ve appbtS€& fleas® tt$gb\mfas mtoixto&mtit&y&GDalf my prieteter®V30i^)ar^ Repa&6c4i®9 of 517 
predecessor did in his. (Applause.) I’ve ordered every federal agency to eliminate rules that 
don’t make sense. We’ve already announced over 500 reforms, and just a fraction of them 
will save business and citizens more than $10 billion over the next five years. We got rid of 
one rule from 40 years ago that could have forced some dairy farmers to spend $10,000 a 
year proving that they could contain a spill ~ because milk was somehow classified as an oil. 

With a rule like that, I guess it was worth crying over spilled milk. (Laughter and applause.) 

Now, I’m confident a farmer can contain a milk spill without a federal agency looking over his 
shoulder. (Applause.) Absolutely. But I will not back down from making sure an oil company 
can contain the kind of oil spill we saw in the Gulf two years ago. (Applause.) I will not back 
down from protecting our kids from mercury poisoning, or making sure that our food is safe 
and our water is clean. I will not go back to the days when health insurance companies had 
unchecked power to cancel your policy, deny your coverage, or charge women differently than 
men. (Applause.) 

And I will not go back to the days when Wall Street was allowed to play by its own set of 
rules. The new rules we passed restore what should be any financial system’s core purpose: 

Getting funding to entrepreneurs with the best ideas, and getting loans to responsible families 
who want to buy a home, or start a business, or send their kids to college. 

So if you are a big bank or financial institution, you’re no longer allowed to make risky bets 
with your customers’ deposits. You’re required to write out a “living will” that details exactly 
how you’ll pay the bills if you fail — because the rest of us are not bailing you out ever again. 
(Applause.) And if you’re a mortgage lender or a payday lender or a credit card company, the 
days of signing people up for products they can’t afford with confusing forms and deceptive 
practices - those days are over. Today, American consumers finally have a watchdog in 
Richard Cordray with one job: To look out for them. (Applause.) 

We’ll also establish a Financial Crimes Unit of highly trained investigators to crack down on 
large-scale fraud and protect people’s investments. Some financial firms violate major anti¬ 
fraud laws because there’s no real penalty for being a repeat offender. That’s bad for 
consumers, and it’s bad for the vast majority of bankers and financial service professionals 
who do the right thing. So pass legislation that makes the penalties for fraud count. 

And tonight, I’m asking my Attorney General to create a special unit of federal prosecutors 
and leading state attorney general to expand our investigations into the abusive lending and 
packaging of risky mortgages that led to the housing crisis. (Applause.) This new unit will 
hold accountable those who broke the law, speed assistance to homeowners, and help turn 
the page on an era of recklessness that hurt so many Americans. 

Now, a return to the American values of fair play and shared responsibility will help protect our 
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invest in our future. 

Right now, our most immediate priority is stopping a tax hike on 160 million working 
Americans while the recovery is still fragile. (Applause.) Biopje cannot afford losinf^Koil 
of each paycheck this year. There are plenty of ways to get this done. So let’s agree right 
here, right now: No side issues. No drama. lU ot4#without dllM^llEif ^iMl! 

it done. (Applause.) 

When it comes to the deficit, we’ve already agreed to more than $2 trillion in cuts and 
savings. But we need to do more, and that means making choices. Right now, we’re poised 
to spend nearly $1 trillion more on what was supposed to be a temporary tax break for the 
wealthiest 2 percent of Americans. Right now, because of loopholes and shelters in the tax 
code, a quarter of all millionaires pay lower tax rates than millions of middle-class 
households. Right now, Warren Buffett pays a lower tax rate than his secretary. 

Do we want to keep these tax cuts for the wealthiest Americans? Or do we want to keep our 
investments in everything else — like education and medical research; a strong military and 
care for our veterans? Because if we’re serious about paying down our debt, we can’t do 
both. 

The American people know what the right choice is. So do I. As I told the Speaker this 
summer, I’m prepared to make more reforms that rein in the long-term costs of Medicare and 
Medicaid, and strengthen Social Security, so long as those programs remain a guarantee of 
security for seniors. 

But in return, we need to change our tax code so that people like me, and an awful lot of 
members of Congress, pay our fair share of taxes. (Applause.) 

Tax reform should follow the Buffett Rule. If you make more than $1 million a year, you 
should not pay less than 30 percent in taxes. And my Republican friend Tom Coburn is right: 
Washington should stop subsidizing millionaires. In fact, if you’re earning a million dollars a 
year, you shouldn’t get special tax subsidies or deductions. On the other hand, if you make 
under $250,000 a year, like 98 percent of American families, your taxes shouldn’t go up. 

(Applause.) You’re the ones struggling with rising costs and stagnant wages. You’re the ones 
who need relief. 

Now, you can call this class warfare all you want. But asking a billionaire to pay at least as 
much as his secretary in taxes? Most Americans would call that common sense. 

We don’t begrudge financial success in this country. We admire it. When Americans talk 
about folks like me paying my fair share of taxes, it’s not because they envy the rich. It’s 
because they understand that when I get a tax break I don’t need and the country can’t afford, 
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a fixed income, or a student trying to get through school, or a family trying to make ends 
meet. That’s not right. Americans know that’s not right. They know that this generation’s 
success is only possible because past generations felt a responsibility to each other, and to 
the future of their country, and they know our way of life will only endure if we feel that same 
sense of shared responsibility. That’s how we’ll reduce our deficit. That’s an America built to 
last. (Applause.) 

Now, I recognize that people watching tonight have differing views about taxes and debt, 
energy and health care. But no matter what party they belong to, I bet most Americans are 
thinking the same thing right about now: Nothing will get done in Washington this year, or 
next year, or maybe even the year after that, because Washington is broken. 

Can you blame them for feeling a little cynical? 

The greatest blow to our confidence in our economy last year didn’t come from events beyond 
our control. It came from a debate in Washington over whether the United States would pay 
its bills or not. Who benefited from that fiasco? 

I’ve talked tonight about the deficit of trust between Main Street and Wall Street. But the 
divide between this city and the rest of the country is at least as bad - and it seems to get 
worse every year. 

Some of this has to do with the corrosive influence of money in politics. So together, let’s take 
some steps to fix that. Send me a bill that bans insider trading by members of Congress; I will 
E— l-liiii rrciB (Applause.) Let’s limit any elected official from owning stocks in industries 
they impact. Let’s make sure people who bundle campaign contributions for Congress can’t 
lobby Congress, and vice versa - an idea that has bipartisan support, at least outside of 
Washington. 

Some of what’s broken has to do with the way Congress does its business these days. A 
simple majority is no longer enough to get anything — even routine business — passed 
through the Senate. (Applause.) Neither party has been blameless in these tactics. Now 
both parties should put an end to it. (Applause.) fpr ask the Senate to 

simple rule that all judicial and public service nominations receive a simple up or down vote 
within 90 day^l (Applause.) 

The executive branch also needs to change. Too often, it’s inefficient, outdated and remote. 
(Applause.) That’s why I’ve asked this Congress to grant me the authority to consolidate the 
federal bureaucracy, so that our government is leaner, quicker, and more responsive to the 
needs of the American people. (Applause.) 

Finally, none of this can happen unless we also lower the temperature in this town. We need 

JA2865 

https://www.whitehouse.gov/the-press-office/2012/01/24/remarks-president-state-union-address [ 1 /20/20168:50:31 PM] 


Remarks by the President in State of the Union Address | whitehouse.gov 

Case l:14-cv-00857-TSC Document 67-12 Filed 01/21/16 Page 15 of 18 
to end thJBGWi©ato#fll?e7G^6partie0(rau^if(ie^lli7)l<S8l5O a perp^UratlcQflrfpa/gOlg muRia^e 502 of 517 
destruction; that politics is about clinging to rigid ideologies instead of building consensus 
around common-sense ideas. 

I’m a Democrat. But I believe what Republican Abraham Lincoln believed: That government 
should do for people only what they cannot do better by themselves, and no more. 

(Applause.) That’s why my education reform offers more competition, and more control for 
schools and states. That’s why we’re getting rid of regulations that don’t work. That’s why our 
health care law relies on a reformed private market, not a government program. 

On the other hand, even my Republican friends who complain the most about government 
spending have supported federally financed roads, and clean energy projects, and federal 
offices for the folks back home. 

The point is, we should all want a smarter, more effective government. And while we may not 
be able to bridge our biggest philosophical differences this year, we can make real progress. 

With or without this Congress, I will keep taking actions that help the economy grow. But I can 
do a whole lot more with your help. Because when we act together, there’s nothing the United 
States of America can’t achieve. (Applause.) That’s the lesson we’ve learned from our 
actions abroad over the last few years. 

Ending the Iraq war has allowed us to strike decisive blows against our enemies. From 
Pakistan to Yemen, the al Qaeda operatives who remain are scrambling, knowing that they 
can’t escape the reach of the United States of America. (Applause.) 

From this position of strength, we’ve begun to wind down the war in Afghanistan. Ten 
thousand of our troops have come home. Twenty-three thousand more will leave by the end 
of this summer. This transition to Afghan lead will continue, and we will build an enduring 
partnership with Afghanistan, so that it is never again a source of attacks against America. 
(Applause.) 

As the tide of war recedes, a wave of change has washed across the Middle East and North 
Africa, from Tunis to Cairo; from Sana’a to Tripoli. A year ago, Qaddafi was one of the world’s 
longest-serving dictators — a murderer with American blood on his hands. Today, he is 
gone. And in Syria, I have no doubt that the Assad regime will soon discover that the forces 
of change cannot be reversed, and that human dignity cannot be denied. (Applause.) 

Flow this incredible transformation will end remains uncertain. But we have a huge stake in 
the outcome. And while it’s ultimately up to the people of the region to decide their fate, we 
will advocate for those values that have served our own country so well. We will stand against 
violence and intimidation. We will stand for the rights and dignity of all human beings — men 
and women; Christians, Muslims and Jews. We will support policies that lead to strong and 
stable democracies and open markets, because tyranny is no match for liberty. 
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And we will safeguard America’s own security against those who threaten our citizens, our 
friends, and our interests. Look at Iran. Through the power of our diplomacy, a world that 
was once divided about how to deal with Iran’s nuclear program now stands as one. The 
regime is more isolated than ever before; its leaders are faced with crippling sanctions, and as 
long as they shirk their responsibilities, this pressure will not relent. 

Let there be no doubt: America is determined to prevent Iran from getting a nuclear weapon, 
and I will take no options off the table to achieve that goal. (Applause.) 

But a peaceful resolution of this issue is still possible, and far better, and if Iran changes 
course and meets its obligations, it can rejoin the community of nations. 

The renewal of American leadership can be felt across the globe. Our oldest alliances in 
Europe and Asia are stronger than ever. Our ties to the Americas are deeper. Our ironclad 
commitment - and I mean ironclad - to Israel’s security has meant the closest military 
cooperation between our two countries in history. (Applause.) 

We’ve made it clear that America is a Pacific power, and a new beginning in Burma has lit a 
new llllll From the coalitions we’ve built to secure nuclear materials, to the missions we’ve 
led against hunger and disease; from the blows we’ve dealt to our enemies, to the enduring 
power of our moral example, America is back. 

Anyone who tells you otherwise, anyone who tells you that America is in decline or that our 
influence has waned, doesn’t know what they’re talking about. (Applause.) 

That’s not the message we get from leaders around the world who are eager to work with us. 

That’s not how people feel from Tokyo to Berlin, from Cape Town to Rio, where opinions of 
America are higher than they’ve been in years. Yes, the world is changing. No, we can’t 
control every event. But America remains the one indispensable nation in world affairs — and 
as long as I’m President, I intend to keep it that way. (Applause.) 

That’s why, working with our military leaders, I’ve proposed a new defense strategy that 
ensures we maintain the finest military in the world, while saving nearly half a trillion dollars in 
our budget. To stay one step ahead of our adversaries, I’ve already sent this Congress 
legislation that will secure our country from the growing dangers of cyber-threats. (Applause.) 

Above all, our freedom endures because of the men and women in uniform who defend it. 
(Applause.) As they come home, we must serve them as well as they’ve served us. That 
includes giving them the care and the benefits they have earned — which is why we’ve 
increased annual VA spending every year I’ve been President. (Applause.) And it means 
enlisting our veterans in the work of rebuilding our nation. 

With the bipartisan support of this Congress, we’re providing new tax credits to companies that 
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of 135,000 jobs for veterans and their families. And tonight, I’m proposing a Veterans Jobs 
Corps that will help our communities hire veterans as cops and firefighters, so that America is 
as strong as those who defend her. (Applause.) 

Which brings me back to where I began. Those of us who’ve been sent here to serve can 
learn a thing or two from the service of our troops. rffeMilShifMrm''.Wdbasftlitj 

matter if you’re black or white; Asian, Latino, Native American; conservative, liberal; rich, poor; 

gay, straight. When you’re marching into battle, you look out for the person next to you, or the 
mission fails. When you’re in the thick of the fight, you rise or fall as one unit, serving one 
nation, leaving no one behind. 

One of my proudest possessions is the flag that the SEAL Team took with them on the mission 
to get bin Laden. On it are each of their names. Some may be Democrats. Some may be 
Republicans. But that doesn’t matter. Just like it didn’t matter that day in the Situation Room, 
when I sat next to Bob Gates - a man who was George Bush’s defense secretary - and 
Hillary Clinton - a woman who ran against me for president. 

All that mattered that day was the mission. No one thought about politics. No one thought 
about themselves. One of the young men involved in the raid later told me that he didn’t 
deserve credit for the mission. It only succeeded, he said, because every single member of 
that unit did their job - the pilot who landed the helicopter that spun out of control; the 
translator who kept others from entering the compound; the troops who separated the women 
and children from the fight; the SEALs who charged up the stairs. More than that, the mission 
only succeeded because every member of that unit trusted each other - because you can’t 
charge up those stairs, into darkness and danger, unless you know that there’s somebody 
behind you, watching your back. 

So it is with America. Each time I look at that flag, I’m reminded that our destiny is stitched 
together like those 50 stars and those 13 stripes. No one built this country on their own. This 
nation is great because we built it together. This nation is great because we worked as a 
team. This nation is great because we get each other’s backs. And if we hold fast to that 
truth, in this moment of trial, there is no challenge too great; no mission too hard. As long as 
we are joined in common purpose, as long as we maintain our common resolve, our journey 
moves forward, and our future is hopeful, and the state of our Union will always be strong. 

Thank you, God bless you, and God bless the United States of America. (Applause.) 

END 

10:16 P.M. EST 
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Summary: We’ve put together a quick primer to help you understand the details 
behind President Obama's announcement that 10 states will recieve waivers 
exempting them from meeting No Child Left Behind's most troublesome and 
restrictive requirements in exchange for setting their own higher, more honest 
standards for student success. 



President Barack Obama, with Secretary of Education Arne Duncan, delivers remarks on education reform and the 
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White House. Feb. 9. 2012. (Official White House Photo by Pete Souza) 

Explaining that our kids can’t wait any long for Congress to act, President Barack Obama announced 
today that ten states that have agreed to implement bold education reforms will receive waivers from the 
burdensome mandates of the federal education law known as No Child Left Behind. These waivers will 
give states the flexibility needed to raise student achievement standards, improve school accountability, 
and increase teacher effectiveness. The ten states approved for flexibility are Colorado, Florida, Georgia, 
Indiana, Kentucky, Massachusetts, Minnesota, New Jersey, Oklahoma, and Tennessee. (UPDATE: An 
eleventh state, New Mexico, was also approved for a waiver shortly after the initial announcement). 

So what does all this mean for our schools? What’s the problem with No Child Left Behind? What’s a 
waiver anyway, and why do states need flexibility? To answer these questions, we’ve put together a 
quick primer to help you understand the details behind today’s announcement. 

WHAT’S THE DEAL WITH NO CHILD LEFT BEHIND? 

No Child Left Behind, the most current version of the Elementary and Secondary Education Act, was 
signed into law in 2001—and is five years overdue to be re-written by Congress. The law’s objective was 
admirable. It shined light on achievement gaps and increased accountability at the school level for high- 
need students. And there’s no question that setting goals and holding schools accountable for meeting 
them is central to an education system that prepares students to compete in a global, 21 st century 
economy. 

As written, however, No Child Left Behind has serious flaws. In fact, some of the law’s requirements are 
actually stifling the kind of reforms we need to really improve student achievement, teacher effectiveness, 
and school accountability. For example, it determines whether schools are falling behind based on test 
scores. It imposes punitive labels and prescribes one-size-fits-all federal mandates for fixing failing 
schools. It’s led states to narrow curriculum to focus more on teaching to the test and less on teaching 
everything else student need to know, and to lower standards to make them easier to meet 

The Obama administration has worked extensively with Congress to re-write the law, and even submitted 
its own blueprint for education reform in March 2010, but legislators have not moved forward. 

WHAT ARE WAIVERS AND WHAT DO THEY HAVE TO DO WITH NO CHILD LEFT BEHIND? 

Waivers provide an opportunity to fix what’s wrong with No Child Left Behind without waiting any longer 
for Congress to Act. States receiving waivers are given flexibility that exempts them from meeting the 
law’s most troublesome and restrictive requirements in exchange for setting their own higher, more 
honest standards for student success. 

For example, waivers will give states the flexibility to: 

• Set their own ambitious but achievable terms for closing achievement gaps and ensuring students 
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proficiency by 2014. Kentucky, for example, has set a goal to cut the number of underperforming 
students in half over the next five years. 

• Design their own strategies to improve their lowest-performing schools and measure student 
progress year over year, instead of relying on absolute numbers and a federally prescribed, “one 
size fits all” approach. Colorado, for example, another state receiving a waiver, is launching a 
website that will allow teachers and parents can see exactly how much progress students are 
making, and how different schools measure up. 

WHY DO STATES NEED FLEXIBILITY? 

States need the flexibility to move forward with innovative education reforms they design themselves — 
rather than a federal mandate—without sacrificing high standards or lowering accountability. After all, 
what works for Kentucky doesn’t necessarily work for New Jersey, and the parents and educators who 
live and work in each place are best-positioned to know the needs of their own communities. 

There is still no clear bipartisan path in Congress for ESEA reauthorization - and we can’t wait any 
longer. Schools and districts continue their daily work of educating students, while also planning for next 
school year, and states need this flexibility now to implement plans for reform and improvement. Today’s 
announcement continues a process the President announced last September. 

The fact is, most states are already pursuing reforms that go above and beyond the requirements in No 
Child Left Behind, and waivers will help them continue that progress. More than 40 states have adopted 
common standards that define what it means to be college and career ready, just as many have designed 
assessments to measure student progress toward achieving those standards. States have reformed 
teacher and principal evaluations to better determine which ones are effective and which ones aren’t, and 
developed support systems to help the less effective ones improve. 

HOW DID THESE STATES QUALIFY FOR WAIVERS? 

President Obama offered every state a deal: If you’re willing to set higher, more honest standards based 
on a clear goal that every student can graduate ready for college or a career, we’ll give you the flexibility 
to meet those standards. 

In addition to setting new performance targets for student achievement, states had to prove that they 
were serious by developing a plan addressing three critical areas: 

• Preparing students for college and careers: States must have already adopted college- and 
career-ready standards in reading and math that raise the achievement of all students, including 
English language Learners and students with disabilities. Additionally, states must create a plan to 
help schools and districts implement those standards and administer statewide tests to measure 
progress. 

• Hold schools accountable for making progress: States must establish an accountability system 
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significant gains in improving student achievement.And they must develop targeted strategies to turn 
around the lowest performing schools and help groups of students with the greatest needs. 

• Improving teacher and principal effectiveness: States must set guidelines for teacher and 
principal evaluation and support systems, developed with input from educators and principals. 
Evaluation systems should assess performance using factors beyond test scores—such as principal 
observation, peer review, student work, or parent and student feedback—and provide teachers with 
both constructive advice for improving and support in doing so. 

WHAT’S NEXT? 

Just as the administration worked extensively with Congress to try re-write No Child Left Behind before 
announcing last September that it would offer states flexibility waivers, President Obama will continue to 
call on Congress to reform the law while offering states that are willing to set higher standards for their 
students the chance to do so. 

In fact, in addition to the 10 states that requested the flexibility to implement reforms through this initial 
round of waivers, an 11th application is still being revised and reviewed, and 28 other states along with 
Puerto Rico and the District of Columbia have also expressed interest in receiving waivers. 

As President Obama explained this afternoon, “if we’re serious about helping our children reach their 
potential, the best ideas aren’t going to come from Washington alone. Our job is to harness those ideas, 
and to hold states and schools accountable for making them work.” 


Update: On May 29, 2012 the U.S. Department of Education granted waivers to an additional eight 
states: Connecticut, Delaware, Louisiana, Maryland, New York, North Carolina, Ohio, and Rhode Island, 
which brings the total number of states to receive waivers to 19, with an additional 18 applications still 
under review. 



Megan Slack 

Former Deputy Director of Digital Content for the Office of Digital Strategy 
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Local 

National resolution against high-stakes tests released 

□ O □ 32 


By Valerie Strauss April 24, 2012 □ 

A national resolution protesting higli-stakes standardized testing was 
released Tuesday by a coalition of national education, civil rights and 
parents groups, as well as educators who are trying to build a broad-based 
movement against the Obama administration’s test-centric school reform 
program. 



This is the latest in a series of recent initiatives taken around the country by 
academics, educators, parents and others to protest the use of student 
standardized test scores for high-stakes decisions, including teacher and 


principal evaluation, student grade promotion and high school graduation. 


The high-stakes testing era started with the advent of No Child Left Behind in 


1 Blizzard watch: 
Severe snowstorm 
likely Friday 
throuah Sundav 


2002, and though NCLB has largely been discredited, the Obama 
administration’s policies have expanded the use of test scores as assessment 
tools not only for students, but also for teachers and principals. 


2 How much snow 
are local 
forecasters and 
computer models 
Dredictina? 


Many researchers in the assessment field have warned against using 
standardized test scores for high-stakes decisions, saying they are unreliable 
for such a purpose. High-stakes standardized testing, they say, has led to the 
narrowing of the curriculum; classrooms where “teaching to the test” is 
paramount; and unfair evaluation of students, teachers, principals and 
schools. 

The resolution (see text below) is modeled on one passed in recent months by 
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commissioner, Robert Scott, made news in February by saying the mentality to name this Winter 

that standardized testing is the “end-all, be-all” is a “perversion” of what a 
quality education should be, and calling “the assessment and accountability 


Unlimited Access to The Post. Just 99p 


regime” not only “a cottage industry but a military-industrial complex.’ 


The organizers want organizations and individuals to endorse the resolution, 
which asks officials in every state to “reexamine public school accountability 
systems” and to “develop a system based on multiple forms of assessment 
which does not require extensive standardized testing” and “more accurately 
reflects the broad range of student learning.” 


The Most Popular All Over 


The Baltimore Sun 

One officers 
statements ruled 
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efforts to block 


The resolution also calls on Congress and the Obama administration to 
rewrite the Elementary and Secondary Education Act, the federal education 
law known in its current form as No Child Left Behind, in a way that reduces 
die mandate for standardized tests, promotes multiple forms of evidence 
that students are learning and does not mandate that student test scores be 
used to evaluate educators. 
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Tlie Most Popular stories around the web 


“Parents are fed up with constant testing,” Pamela Grundy, of Parents Across 
America, was quoted as saying in a statement. She helped lead a community 
revolt against expanding testing in Charlotte, N.C., last year. “We want our 
elected leaders to support real learning, not endless evaluation,” she said. 

The national resolution was written by Advancement Project; Asian 
American Legal Defense and Education Fund; FairTest; Forum for 
Education and Democracy; MecklenburgACTS; Deborah Meier; NAACP 
Legal Defense and Educational Fund, Inc.; National Education Association; 
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Across America; Parents United for Responsible Education - Chicago; Diane 
Ravitch; Race to Nowhere; Time Out From Testing; and United Church of 
Christ Justice and Witness Ministries. 


Already a number of other organizations and individuals from around the 
country have signed on to the resolution. 

In recent months, protests by parents and educators have been increasing in 
a number of states in addition to Texas, including New York, California and 
Illinois. This resolution is an effort to make a national statement about the 
dangers of high-stakes testing that gets the attention of policy makers at the 
state and federal levels. 

Here’s the text of the national resolution: 

WHEREAS, our nation’s future well-being relies on a high-quality public 
education system that prepares all students for college, careers, citizenship 
and lifelong learning, and strengthens the nation’s social and economic 
well-being; and 

WHEREAS, our nation’s school systems have been spending growing 
amounts of time, money and energy on high-stakes standardized testing, in 
which student performance on standardized tests is used to make major 
decisions affecting individual students, educators and schools; and 

WHEREAS, the over-reliance on high-stakes standardized testing in state 
and federal accountability systems is undermining educational quality and 
equity in U.S. public schools by hampering educators’ efforts to focus on the 
broad range of learning experiences that promote the innovation, 
creativity, problem solving, collaboration, communication, critical 
thinking and deep subject-matter knowledge that will allow students to 
thrive in a democracy and an increasingly global society and economy; 
and 

WHEREAS, it is widely recognized that standardized testing is an 
inadequate and often unreliable measure of both student learning and 
educator effectiveness; and 

WHEREAS, the over-emphasis on standardized testing has caused 
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the curriculum, teaching to the test, reducing love of learning, pushing 
students out of school, driving excellent teachers out of the profession, and 
undermining school climate; and 

WHEREAS, high-stakes standardized testing has negative effects for 
students from all backgrounds, and especially for low-income students, 

English language learners, children of color, and those with disabilities; 
and 

WHEREAS, the culture and structure of the systems in which students learn 
must change in order to foster engaging school experiences that promote 
joy in learning, depth of thought and breadth of knowledge for students; 
therefore be it 

RESOLVED, that [your organization name] calls on the governor, state 
legislature and state education boards and administrators to reexamine 
public school accountability systems in this state, and to develop a system 
based on multiple forms of assessment which does not require extensive 
standardized testing, more accurately reflects the broad range of student 
learning, and is used to support students and improve schools; and 

RESOLVED, that [your organization name] calls on the U.S. Congress arid 
Administration to overhaul the Elementary and Secondary Education Act, 
currently known as the “No Child Left Behind Act,” reduce the testing 
mandates, promote multiple forms of evidence of student learning and 
school quality in accountability, and not mandate any fixed role for the use 
of student test scores in evaluating educators. 
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